[Lguest] [PATCH 0/2] x86/entry/32: Get rid of paravirt

Discussion:

[Lguest] [PATCH 0/2] x86/entry/32: Get rid of paravirt_enabled in ESPFIX

Andy Lutomirski

2016-02-29 23:45:22 UTC

Hi Luis-

As promised, here are these patches.

Borislav, if you're okay with this (ab)use of the cpufeatures stuff
and if they survive review, I'd be okay with them joining your
series or going straight into tip:x86/asm.

Andy Lutomirski (2):
x86/nmi/64: Giant debugging hack
x86/entry/32: Introduce and use X86_BUG_ESPFIX instead of
paravirt_enabled

arch/x86/entry/entry_32.S | 15 ++-------------
arch/x86/entry/entry_64.S | 34 ++++++++++++++++++++++++++++++----
arch/x86/include/asm/cpufeatures.h | 8 ++++++++
arch/x86/include/asm/segment.h | 5 ++++-
arch/x86/kernel/cpu/common.c | 27 +++++++++++++++++++++++++++
5 files changed, 71 insertions(+), 18 deletions(-)

--
2.5.0

Andy Lutomirski

2016-02-29 23:45:23 UTC

Permalink

x86_64 has very clean espfix handling on paravirt: espfix64 is set
up in native_iret, so paravirt systems that override iret bypass
espfix64 automatically. This is robust and straightforward.

x86_32 is messier. espfix is set up before the IRET paravirt patch
point, so it can't be directly conditionalized on whether we use
native_iret. We also can't easily move it into native_iret without
regressing performance due to a bizarre consideration. Specifically,
on 64-bit kernels, the logic is:

if (regs->ss & 0x4)
setup_espfix;

On 32-bit kernels, the logic is:

if ((regs->ss & 0x4) && (regs->cs & 0x3) == 3 &&
(regs->flags & X86_EFLAGS_VM) == 0)
setup_espfix;

The performance of setup_espfix itself is essentially irrelevant, but
the comparison happens on every IRET so its performance matters. On
x86_64, there's no need for any registers except flags to implement
the comparison, so we fold the whole thing into native_iret. On
x86_32, we don't do that because we need a free register to
implement the comparison efficiently. We therefore do espfix setup
before restoring registers on x86_32.

This patch gets rid of the explicit paravirt_enabled check by
introducing X86_BUG_ESPFIX on 32-bit systems and using an ALTERNATIVE
to skip espfix on paravirt systems where iret != native_iret. This is
also messy, but it's at least in line with other things we do.

This improves espfix performance by removing a branch, but no one
cares. More importantly, it removes a paravirt_enabled user, which is
good because paravirt_enabled is ill-defined and is going away.

Signed-off-by: Andy Lutomirski <***@kernel.org>
---
arch/x86/entry/entry_32.S | 15 ++-------------
arch/x86/include/asm/cpufeatures.h | 8 ++++++++
arch/x86/kernel/cpu/common.c | 25 +++++++++++++++++++++++++
3 files changed, 35 insertions(+), 13 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 6a792603f9f3..4f7615e405a4 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -418,6 +418,8 @@ restore_all:
TRACE_IRQS_IRET
restore_all_notrace:
#ifdef CONFIG_X86_ESPFIX32
+ ALTERNATIVE "jmp restore_nocheck", "", X86_BUG_ESPFIX
+
movl PT_EFLAGS(%esp), %eax # mix EFLAGS, SS and CS
/*
* Warning: PT_OLDSS(%esp) contains the wrong/random values if we
@@ -444,19 +446,6 @@ ENTRY(iret_exc )

#ifdef CONFIG_X86_ESPFIX32
ldt_ss:
-#ifdef CONFIG_PARAVIRT
- /*
- * The kernel can't run on a non-flat stack if paravirt mode
- * is active. Rather than try to fixup the high bits of
- * ESP, bypass this code entirely. This may break DOSemu
- * and/or Wine support in a paravirt VM, although the option
- * is still available to implement the setting of the high
- * 16-bits in the INTERRUPT_RETURN paravirt-op.
- */
- cmpl $0, pv_info+PARAVIRT_enabled
- jne restore_nocheck
-#endif
-
/*
* Setup and switch to ESPFIX stack
*
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 6663fae71b12..d11a3aaafd96 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -286,4 +286,12 @@
#define X86_BUG_CLFLUSH_MONITOR X86_BUG(7) /* AAI65, CLFLUSH required before MONITOR */
#define X86_BUG_SYSRET_SS_ATTRS X86_BUG(8) /* SYSRET doesn't fix up SS attrs */

+#ifdef CONFIG_X86_32
+/*
+ * 64-bit kernels don't use X86_BUG_ESPFIX. Make the define conditional
+ * to avoid confusion.
+ */
+#define X86_BUG_ESPFIX X86_BUG(9) /* "" IRET to 16-bit SS corrupts ESP/RSP high bits */
+#endif
+
#endif /* _ASM_X86_CPUFEATURES_H */
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 91ddae732a36..f398dc3f6ad4 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -979,6 +979,31 @@ static void identify_cpu(struct cpuinfo_x86 *c)
#ifdef CONFIG_NUMA
numa_add_cpu(smp_processor_id());
#endif
+
+ /*
+ * ESPFIX is a strange bug. All real CPUs have it. Paravirt
+ * systems that run Linux at CPL > 0 may or may not have the
+ * issue, but, even if they have the issue, there's absolutely
+ * nothing we can do about it because we can't use the real IRET
+ * instruction.
+ *
+ * NB: For the time being, only 32-bit kernels support
+ * X86_BUG_ESPFIX as such. 64-bit kernels directly choose
+ * whether to apply espfix using paravirt hooks. If any
+ * non-paravirt system ever shows up that does *not* have the
+ * ESPFIX issue, we can change this.
+ */
+#ifdef CONFIG_X86_32
+#ifdef CONFIG_PARAVIRT
+ do {
+ extern void native_iret(void);
+ if (pv_cpu_ops.iret == native_iret)
+ set_cpu_bug(c, X86_BUG_ESPFIX);
+ } while(0);
+#else
+ set_cpu_bug(c, X86_BUG_ESPFIX);
+#endif
+#endif
}

/*

--
2.5.0

Andy Lutomirski

2016-02-29 23:45:24 UTC

Permalink

Signed-off-by: Andy Lutomirski <***@kernel.org>
---
arch/x86/entry/entry_64.S | 34 ++++++++++++++++++++++++++++++----
arch/x86/include/asm/segment.h | 5 ++++-
arch/x86/kernel/cpu/common.c | 2 ++
3 files changed, 36 insertions(+), 5 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 70eadb0ea5fa..2f61059dacc3 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -37,6 +37,9 @@
#include <asm/pgtable_types.h>
#include <linux/err.h>

+#define NMI_NORECURSE_SS (GDT_ENTRY_FOO << 3)
+#define NMI_RECURSIVE_SS (GDT_ENTRY_BAR << 3)
+
/* Avoid __ASSEMBLER__'ifying <linux/audit.h> just for this. */
#include <linux/elf-em.h>
#define AUDIT_ARCH_X86_64 (EM_X86_64|__AUDIT_ARCH_64BIT|__AUDIT_ARCH_LE)
@@ -523,6 +526,10 @@ restore_c_regs_and_iret:
INTERRUPT_RETURN

ENTRY(native_iret)
+ cmp $NMI_NORECURSE_SS, 4*8(%rsp)
+ jne 1f
+ ud2
+1:
/*
* Are we returning to a stack segment from the LDT? Note: in
* 64-bit mode SS:RSP on the exception stack is always valid.
@@ -1113,6 +1120,13 @@ ENTRY(nmi)
/* Use %rdx as our temp variable throughout */
pushq %rdx

+ cmpl $NMI_NORECURSE_SS, 4*8(%rsp)
+ jne 1f
+ ud2
+1:
+ movl $NMI_NORECURSE_SS, %edx
+ mov %dx, %ss
+
testb $3, CS-RIP+8(%rsp)
jz .Lnmi_from_kernel

@@ -1158,6 +1172,8 @@ ENTRY(nmi)
* due to nesting -- we're on the normal thread stack and we're
* done with the NMI stack.
*/
+ mov $__KERNEL_DS, %edx
+ mov %dx, %ss

movq %rsp, %rdi
movq $-1, %rsi
@@ -1240,6 +1256,11 @@ ENTRY(nmi)
cmpl $1, -8(%rsp)
je nested_nmi

+ cmpl $0, -8(%rsp)
+ je 1f
+ ud2 /* corrupt */
+1:
+
/*
* Now test if the previous stack was an NMI stack. This covers
* the case where we interrupt an outer NMI after it clears
@@ -1277,7 +1298,7 @@ nested_nmi:
*/
subq $8, %rsp
leaq -10*8(%rsp), %rdx
- pushq $__KERNEL_DS
+ pushq $NMI_RECURSIVE_SS
pushq %rdx
pushfq
pushq $__KERNEL_CS
@@ -1293,6 +1314,11 @@ nested_nmi_out:
INTERRUPT_RETURN

first_nmi:
+ cmpl $NMI_RECURSIVE_SS, 4*8(%rsp)
+ jne 1f
+ ud2
+1:
+
/* Restore rdx. */
movq (%rsp), %rdx

@@ -1309,12 +1335,12 @@ first_nmi:

/* Everything up to here is safe from nested NMIs */

-#ifdef CONFIG_DEBUG_ENTRY
+/*#ifdef CONFIG_DEBUG_ENTRY*/
/*
* For ease of testing, unmask NMIs right away. Disabled by
* default because IRET is very expensive.
*/
- pushq $0 /* SS */
+ pushq $NMI_RECURSIVE_SS /* SS */
pushq %rsp /* RSP (minus 8 because of the previous push) */
addq $8, (%rsp) /* Fix up RSP */
pushfq /* RFLAGS */
@@ -1322,7 +1348,7 @@ first_nmi:
pushq $1f /* RIP */
INTERRUPT_RETURN /* continues at repeat_nmi below */
1:
-#endif
+/*#endif*/

repeat_nmi:
/*
diff --git a/arch/x86/include/asm/segment.h b/arch/x86/include/asm/segment.h
index 7d5a1929d76b..58473396f290 100644
--- a/arch/x86/include/asm/segment.h
+++ b/arch/x86/include/asm/segment.h
@@ -176,6 +176,9 @@
#define GDT_ENTRY_DEFAULT_USER_DS 5
#define GDT_ENTRY_DEFAULT_USER_CS 6

+#define GDT_ENTRY_FOO 7
+#define GDT_ENTRY_BAR 16
+
/* Needs two entries */
#define GDT_ENTRY_TSS 8
/* Needs two entries */
@@ -190,7 +193,7 @@
/*
* Number of entries in the GDT table:
*/
-#define GDT_ENTRIES 16
+#define GDT_ENTRIES 17

/*
* Segment selector values corresponding to the above entries:
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 079d83fc6488..91ddae732a36 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -108,6 +108,8 @@ DEFINE_PER_CPU_PAGE_ALIGNED(struct gdt_page, gdt_page) = { .gdt = {
[GDT_ENTRY_DEFAULT_USER32_CS] = GDT_ENTRY_INIT(0xc0fb, 0, 0xfffff),
[GDT_ENTRY_DEFAULT_USER_DS] = GDT_ENTRY_INIT(0xc0f3, 0, 0xfffff),
[GDT_ENTRY_DEFAULT_USER_CS] = GDT_ENTRY_INIT(0xa0fb, 0, 0xfffff),
+ [GDT_ENTRY_FOO] = GDT_ENTRY_INIT(0xc093, 0, 0xfffff),
+ [GDT_ENTRY_BAR] = GDT_ENTRY_INIT(0xc093, 0, 0xfffff),
#else
[GDT_ENTRY_KERNEL_CS] = GDT_ENTRY_INIT(0xc09a, 0, 0xfffff),
[GDT_ENTRY_KERNEL_DS] = GDT_ENTRY_INIT(0xc092, 0, 0xfffff),

--
2.5.0

Andy Lutomirski

2016-02-29 23:45:25 UTC

Permalink

It no longer has any users.

Signed-off-by: Andy Lutomirski <***@kernel.org>
---
arch/x86/kernel/asm-offsets.c | 1 -
1 file changed, 1 deletion(-)

diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 84a7524b202c..5c042466f274 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -59,7 +59,6 @@ void common(void) {

#ifdef CONFIG_PARAVIRT
BLANK();
- OFFSET(PARAVIRT_enabled, pv_info, paravirt_enabled);
OFFSET(PARAVIRT_PATCH_pv_cpu_ops, paravirt_patch_template, pv_cpu_ops);
OFFSET(PARAVIRT_PATCH_pv_irq_ops, paravirt_patch_template, pv_irq_ops);
OFFSET(PV_IRQ_irq_disable, pv_irq_ops, irq_disable);

--
2.5.0

Andy Lutomirski

2016-02-29 23:45:26 UTC

Permalink

x86_64 has very clean espfix handling on paravirt: espfix64 is set
up in native_iret, so paravirt systems that override iret bypass
espfix64 automatically. This is robust and straightforward.

x86_32 is messier. espfix is set up before the IRET paravirt patch
point, so it can't be directly conditionalized on whether we use
native_iret. We also can't easily move it into native_iret without
regressing performance due to a bizarre consideration. Specifically,
on 64-bit kernels, the logic is:

if (regs->ss & 0x4)
setup_espfix;

On 32-bit kernels, the logic is:

if ((regs->ss & 0x4) && (regs->cs & 0x3) == 3 &&
(regs->flags & X86_EFLAGS_VM) == 0)
setup_espfix;

The performance of setup_espfix itself is essentially irrelevant, but
the comparison happens on every IRET so its performance matters. On
x86_64, there's no need for any registers except flags to implement
the comparison, so we fold the whole thing into native_iret. On
x86_32, we don't do that because we need a free register to
implement the comparison efficiently. We therefore do espfix setup
before restoring registers on x86_32.

This patch gets rid of the explicit paravirt_enabled check by
introducing X86_BUG_ESPFIX on 32-bit systems and using an ALTERNATIVE
to skip espfix on paravirt systems where iret != native_iret. This is
also messy, but it's at least in line with other things we do.

This improves espfix performance by removing a branch, but no one
cares. More importantly, it removes a paravirt_enabled user, which is
good because paravirt_enabled is ill-defined and is going away.

Signed-off-by: Andy Lutomirski <***@kernel.org>
---
arch/x86/entry/entry_32.S | 15 ++-------------
arch/x86/include/asm/cpufeatures.h | 8 ++++++++
arch/x86/kernel/cpu/common.c | 25 +++++++++++++++++++++++++
3 files changed, 35 insertions(+), 13 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 6a792603f9f3..4f7615e405a4 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -418,6 +418,8 @@ restore_all:
TRACE_IRQS_IRET
restore_all_notrace:
#ifdef CONFIG_X86_ESPFIX32
+ ALTERNATIVE "jmp restore_nocheck", "", X86_BUG_ESPFIX
+
movl PT_EFLAGS(%esp), %eax # mix EFLAGS, SS and CS
/*
* Warning: PT_OLDSS(%esp) contains the wrong/random values if we
@@ -444,19 +446,6 @@ ENTRY(iret_exc )

#ifdef CONFIG_X86_ESPFIX32
ldt_ss:
-#ifdef CONFIG_PARAVIRT
- /*
- * The kernel can't run on a non-flat stack if paravirt mode
- * is active. Rather than try to fixup the high bits of
- * ESP, bypass this code entirely. This may break DOSemu
- * and/or Wine support in a paravirt VM, although the option
- * is still available to implement the setting of the high
- * 16-bits in the INTERRUPT_RETURN paravirt-op.
- */
- cmpl $0, pv_info+PARAVIRT_enabled
- jne restore_nocheck
-#endif
-
/*
* Setup and switch to ESPFIX stack
*
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 6663fae71b12..d11a3aaafd96 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -286,4 +286,12 @@
#define X86_BUG_CLFLUSH_MONITOR X86_BUG(7) /* AAI65, CLFLUSH required before MONITOR */
#define X86_BUG_SYSRET_SS_ATTRS X86_BUG(8) /* SYSRET doesn't fix up SS attrs */

+#ifdef CONFIG_X86_32
+/*
+ * 64-bit kernels don't use X86_BUG_ESPFIX. Make the define conditional
+ * to avoid confusion.
+ */
+#define X86_BUG_ESPFIX X86_BUG(9) /* "" IRET to 16-bit SS corrupts ESP/RSP high bits */
+#endif
+
#endif /* _ASM_X86_CPUFEATURES_H */
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 91ddae732a36..c6ef4da8e4f4 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -979,6 +979,31 @@ static void identify_cpu(struct cpuinfo_x86 *c)
#ifdef CONFIG_NUMA
numa_add_cpu(smp_processor_id());
#endif
+
+ /*
+ * ESPFIX is a strange bug. All real CPUs have it. Paravirt
+ * systems that run Linux at CPL > 0 may or may not have the
+ * issue, but, even if they have the issue, there's absolutely
+ * nothing we can do about it because we can't use the real IRET
+ * instruction.
+ *
+ * NB: For the time being, only 32-bit kernels support
+ * X86_BUG_ESPFIX as such. 64-bit kernels directly choose
+ * whether to apply espfix using paravirt hooks. If any
+ * non-paravirt system ever shows up that does *not* have the
+ * ESPFIX issue, we can change this.
+ */
+#ifdef CONFIG_X86_32
+#ifdef CONFIG_PARAVIRT
+ do {
+ extern void native_iret(void);
+ if (pv_cpu_ops.iret == native_iret)
+ set_cpu_bug(c, X86_BUG_ESPFIX);
+ } while (0);
+#else
+ set_cpu_bug(c, X86_BUG_ESPFIX);
+#endif
+#endif
}

/*

--
2.5.0