进程切换 之 switch_mm_irqs_off

博客内容主要涉及Linux内核中的进程切换过程,特别是`switch_mm_irqs_off`函数与`WRMSR`(Write to Model Specific Register)指令的使用。内容提到了在内核版本4.4引入的`switch_mm_irqs_off`函数,以及该函数在处理间接分支预测器(IBPB)时如何通过`wrmsr`接口写入Model Specific Register(MSR)。同时,文章讨论了`switch_mm_irqs_off`中可能导致异常的情况,如尝试写入保留或未实现的MSR地址,以及如何通过内核启动参数控制相关安全特性。

 

 

   74.474853] ACPI: Low-level resume complete
[   74.480221] PM: Restoring platform NVS memory
[   74.489112] Enabling non-boot CPUs ...
[   74.493513] x86: Booting SMP configuration:
[   74.498341] smpboot: Booting Node 0 Processor 1 APIC 0x1
[   74.507285]  cache: parent cpu1 should not be sleeping
[   74.513420] CPU1 is up
[   74.516196] smpboot: Booting Node 0 Processor 2 APIC 0x2
[   74.525792]  cache: parent cpu2 should not be sleeping
[   74.532148] CPU2 is up
[   74.534929] smpboot: Booting Node 0 Processor 3 APIC 0x3
[   74.543873]  cache: parent cpu3 should not be sleeping
[   74.549929] CPU3 is up
[   74.552669] smpboot: Booting Node 0 Processor 4 APIC 0x4
[   74.562247]  cache: parent cpu4 should not be sleeping
[   74.568451] CPU4 is up
[   74.571326] smpboot: Booting Node 0 Processor 5 APIC 0x5
[   74.580175]  cache: parent cpu5 should not be sleeping
[   74.586215] CPU5 is up
[   74.588939] smpboot: Booting Node 0 Processor 6 APIC 0x6
[   74.598292]  cache: parent cpu6 should not be sleeping
[   74.604653] CPU6 is up
[   74.607406] smpboot: Booting Node 0 Processor 7 APIC 0x7
[   74.616139]  cache: parent cpu7 should not be sleeping
[   74.622206] CPU7 is up
[   74.624915] smpboot: Booting Node 0 Processor 8 APIC 0x8
[   74.634096]  cache: parent cpu8 should not be sleeping
[   74.640432] CPU8 is up
[   74.643389] smpboot: Booting Node 0 Processor 9 APIC 0x9
[   74.651888]  cache: parent cpu9 should not be sleeping
[   74.657977] CPU9 is up
[   74.660680] smpboot: Booting Node 0 Processor 10 APIC 0xa
[   74.669746]  cache: parent cpu10 should not be sleeping
[   74.676259] CPU10 is up
[   74.679161] smpboot: Booting Node 0 Processor 11 APIC 0xb
[   74.687636]  cache: parent cpu11 should not be sleeping
[   74.693824] CPU11 is up
[   74.696603] smpboot: Booting Node 0 Processor 12 APIC 0xc
[   74.702671] general protection fault: 0000 [#1] SMP NOPTI
[   74.708739] CPU: 8 PID: 0 Comm: swapper/8 Tainted: G           O      4.19.0-desktop-amd64 #3100
[   74.718575] Hardware name: Suma W3330H0/22DB4, BIOS CWTQ051207 06/02/2021
[   74.726191] RIP: 0010:switch_mm_irqs_off+0x3ec/0x4f0
[   74.731763] Code: 7d 08 49 83 c5 18 4c 89 fa 31 f6 e8 ce 56 99 00 49 8b 45 00 48 85 c0 75 e5 e9 73 ff ff ff b9 49 00 00 00 b8 01 00 00 00 31 d2 <0f> 30 e9 86 fc ff ff e9 81 00 00 00 48 c7 c2 e0 19 02 00 31 c0 65
[   74.752755] RSP: 0018:ffffa98a4329be30 EFLAGS: 00010046
[   74.758616] RAX: 0000000000000001 RBX: ffff8e3a8be02200 RCX: 0000000000000049
[   74.766619] RDX: 0000000000000000 RSI: ffff8e3a8be02200 RDI: ffff8e3a8ba2d880
[   74.774612] RBP: ffffffffbd07ca60 R08: ffff8e3a9ee22ae0 R09: 0000000000000000
[   74.782605] R10: 0000000000000000 R11: 0000001113e35491 R12: 0000000000000008
[   74.790598] R13: ffff8e3a8ba2d880 R14: ffffffffbd07ca60 R15: ffff8e3a8be02200
[   74.798598] FS:  0000000000000000(0000) GS:ffff8e3a9ee00000(0000) knlGS:0000000000000000
[   74.807662] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   74.814103] CR2: 0000000000000000 CR3: 000000062da0a000 CR4: 00000000003406e0
[   74.822096] Call Trace:
[   74.824865]  __schedule+0x260/0x840
[   74.828794]  schedule_idle+0x1e/0x40
[   74.832812]  do_idle+0x165/0x250
[   74.836450]  cpu_startup_entry+0x6f/0x80
[   74.840854]  start_secondary+0x1a4/0x200
[   74.845264]  secondary_startup_64+0xa4/0xb0
[   74.849963] Modules linked in: dm_mod bnep fuse cfg80211 st sr_mod amd64_edac_mod joydev cdrom edac_mce_amd kvm_amd bluetooth drbg ansi_cprng ecdh_generic rfkill nls_ascii nls_cp437 vfat fat snd_hda_codec_hdmi snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_timer snd kvm soundcore irqbypass sg pcspkr efi_pstore efivars k10temp ccp pcc_cpufreq evdev acpi_cpufreq mincores(O) i2c_dev vfs_monitor(O) binfmt_misc efivarfs ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 fscrypto ecb crypto_simd cryptd glue_helper aes_x86_64 btrfs xor zstd_compress raid6_pq libcrc32c crc32c_generic zstd_decompress xxhash amdgpu chash gpu_sched hid_generic usbhid hid sd_mod radeon i2c_algo_bit ahci ttm libahci drm_kms_helper libata r8169 realtek crc32c_intel scsi_mod drm libphy button
[   74.926832] ---[ end trace a4da77854ee63e9f ]---
[   74.932012] RIP: 0010:switch_mm_irqs_off+0x3ec/0x4f0
[   74.937582] Code: 7d 08 49 83 c5 18 4c 89 fa 31 f6 e8 ce 56 99 00 49 8b 45 00 48 85 c0 75 e5 e9 73 ff ff ff b9 49 00 00 00 b8 01 00 00 00 31 d2 <0f> 30 e9 86 fc ff ff e9 81 00 00 00 48 c7 c2 e0 19 02 00 31 c0 65
[   74.958575] RSP: 0018:ffffa98a4329be30 EFLAGS: 00010046
[   74.964434] RAX: 0000000000000001 RBX: ffff8e3a8be02200 RCX: 0000000000000049
[   74.972419] RDX: 0000000000000000 RSI: ffff8e3a8be02200 RDI: ffff8e3a8ba2d880
[   74.980412] RBP: ffffffffbd07ca60 R08: ffff8e3a9ee22ae0 R09: 0000000000000000
[   74.988405] R10: 0000000000000000 R11: 0000001113e35491 R12: 0000000000000008
[   74.996398] R13: ffff8e3a8ba2d880 R14: ffffffffbd07ca60 R15: ffff8e3a8be02200
[   75.004391] FS:  0000000000000000(0000) GS:ffff8e3a9ee00000(0000) knlGS:0000000000000000
[   75.013454] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   75.019893] CR2: 0000000000000000 CR3: 000000062da0a000 CR4: 00000000003406e0
[   75.027888] Kernel panic - not syncing: Attempted to kill the idle task!
[   76.307847] Shutting down cpus with NMI
[   76.312170] Kernel Offset: 0x3b000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[   76.324234] ---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]---

 

$ cat obj.txt |grep switch_mm_irq
ffffffff8106d480 <switch_mm_irqs_off>:
ffffffff8106d4b5:       0f 84 1d 02 00 00       je     ffffffff8106d6d8 <switch_mm_irqs_off+0x258>
ffffffff8106d4c1:       74 43                   je     ffffffff8106d506 <switch_mm_irqs_off+0x86>
ffffffff8106d4cd:       74 37                   je     ffffffff8106d506 <switch_mm_irqs_off+0x86>
ffffffff8106d4cf:       e9 2d 00 00 00          jmpq   ffffffff8106d501 <switch_mm_irqs_off+0x81>
ffffffff8106d4ec:       74 0b                   je     ffffffff8106d4f9 <switch_mm_irqs_off+0x79>
ffffffff8106d4f3:       0f 85 67 03 00 00       jne    ffffffff8106d860 <switch_mm_irqs_off+0x3e0>
ffffffff8106d521:       0f 84 89 03 00 00       je     ffffffff8106d8b0 <switch_mm_irqs_off+0x430>
ffffffff8106d52e:       74 0f                   je     ffffffff8106d53f <switch_mm_irqs_off+0xbf>
ffffffff8106d546:       74 0c                   je     ffffffff8106d554 <switch_mm_irqs_off+0xd4>
ffffffff8106d569:       0f 85 04 03 00 00       jne    ffffffff8106d873 <switch_mm_irqs_off+0x3f3>
ffffffff8106d59b:       0f 84 0a 01 00 00       je     ffffffff8106d6ab <switch_mm_irqs_off+0x22b>
ffffffff8106d5a7:       75 d6                   jne    ffffffff8106d57f <switch_mm_irqs_off+0xff>
ffffffff8106d5be:       0f 87 1e 03 00 00       ja     ffffffff8106d8e2 <switch_mm_irqs_off+0x462>
ffffffff8106d5e0:       eb 1b                   jmp    ffffffff8106d5fd <switch_mm_irqs_off+0x17d>
ffffffff8106d645:       0f 84 f0 00 00 00       je     ffffffff8106d73b <switch_mm_irqs_off+0x2bb>
ffffffff8106d671:       74 0f                   je     ffffffff8106d682 <switch_mm_irqs_off+0x202>
ffffffff8106d69a:       0f 85 27 02 00 00       jne    ffffffff8106d8c7 <switch_mm_irqs_off+0x447>
ffffffff8106d6bb:       0f 87 1b ff ff ff       ja     ffffffff8106d5dc <switch_mm_irqs_off+0x15c>
ffffffff8106d6d3:       e9 4f ff ff ff          jmpq   ffffffff8106d627 <switch_mm_irqs_off+0x1a7>
ffffffff8106d6df:       74 bf                   je     ffffffff8106d6a0 <switch_mm_irqs_off+0x220>
ffffffff8106d6ec:       72 b2                   jb     ffffffff8106d6a0 <switch_mm_irqs_off+0x220>
ffffffff8106d713:       0f 84 ed fd ff ff       je     ffffffff8106d506 <switch_mm_irqs_off+0x86>
ffffffff8106d736:       e9 cb fd ff ff          jmpq   ffffffff8106d506 <switch_mm_irqs_off+0x86>
ffffffff8106d761:       0f 84 1b ff ff ff       je     ffffffff8106d682 <switch_mm_irqs_off+0x202>
ffffffff8106d767:       e9 07 ff ff ff          jmpq   ffffffff8106d673 <switch_mm_irqs_off+0x1f3>
ffffffff8106d77d:       0f 83 a4 fe ff ff       jae    ffffffff8106d627 <switch_mm_irqs_off+0x1a7>
ffffffff8106d78f:       0f 85 72 01 00 00       jne    ffffffff8106d907 <switch_mm_irqs_off+0x487>
ffffffff8106d7b3:       74 1e                   je     ffffffff8106d7d3 <switch_mm_irqs_off+0x353>
ffffffff8106d7d1:       75 e6                   jne    ffffffff8106d7b9 <switch_mm_irqs_off+0x339>
ffffffff8106d7e7:       e9 3b fe ff ff          jmpq   ffffffff8106d627 <switch_mm_irqs_off+0x1a7>
ffffffff8106d7fd:       0f 83 24 fe ff ff       jae    ffffffff8106d627 <switch_mm_irqs_off+0x1a7>
ffffffff8106d80f:       0f 85 eb 00 00 00       jne    ffffffff8106d900 <switch_mm_irqs_off+0x480>
ffffffff8106d833:       74 9e                   je     ffffffff8106d7d3 <switch_mm_irqs_off+0x353>
ffffffff8106d859:       75 e5                   jne    ffffffff8106d840 <switch_mm_irqs_off+0x3c0>
ffffffff8106d85b:       e9 73 ff ff ff          jmpq   ffffffff8106d7d3 <switch_mm_irqs_off+0x353>
ffffffff8106d86e:       e9 86 fc ff ff          jmpq   ffffffff8106d4f9 <switch_mm_irqs_off+0x79>
ffffffff8106d88c:       74 08                   je     ffffffff8106d896 <switch_mm_irqs_off+0x416>
ffffffff8106d8a1:       75 de                   jne    ffffffff8106d881 <switch_mm_irqs_off+0x401>
ffffffff8106d8ab:       e9 bf fc ff ff          jmpq   ffffffff8106d56f <switch_mm_irqs_off+0xef>
ffffffff8106d8c2:       e9 60 fc ff ff          jmpq   ffffffff8106d527 <switch_mm_irqs_off+0xa7>
ffffffff8106d8d1:       75 3b                   jne    ffffffff8106d90e <switch_mm_irqs_off+0x48e>
ffffffff8106d8dd:       e9 be fd ff ff          jmpq   ffffffff8106d6a0 <switch_mm_irqs_off+0x220>
ffffffff8106d8f4:       e9 f5 fc ff ff          jmpq   ffffffff8106d5ee <switch_mm_irqs_off+0x16e>
ffffffff8106d8fb:       e9 6f fc ff ff          jmpq   ffffffff8106d56f <switch_mm_irqs_off+0xef>
ffffffff8106d902:       e9 0e ff ff ff          jmpq   ffffffff8106d815 <switch_mm_irqs_off+0x395>
ffffffff8106d909:       e9 87 fe ff ff          jmpq   ffffffff8106d795 <switch_mm_irqs_off+0x315>
ffffffff8106d919:       77 34                   ja     ffffffff8106d94f <switch_mm_irqs_off+0x4cf>
ffffffff8106d938:       e9 63 fd ff ff          jmpq   ffffffff8106d6a0 <switch_mm_irqs_off+0x220>
ffffffff8106d94a:       e9 51 fd ff ff          jmpq   ffffffff8106d6a0 <switch_mm_irqs_off+0x220>
ffffffff8106d95d:       e9 3e fd ff ff          jmpq   ffffffff8106d6a0 <switch_mm_irqs_off+0x220>
ffffffff8106d987:       e8 f4 fa ff ff          callq  ffffffff8106d480 <switch_mm_irqs_off>
ffffffff8106db30:       e9 4b f9 ff ff          jmpq   ffffffff8106d480 <switch_mm_irqs_off>
ffffffff8176d59e:       e8 dd fe 8f ff          callq  ffffffff8106d480 <switch_mm_irqs_off>
 

 

addr2line ffffffff8106d480 -e vmlinux-4.19.0-6-amd64 
./debian/build/build_amd64_none_amd64/./arch/x86/mm/tlb.c:274         (实际代码在272行)                                                                                  
 addr2line ffffffff8106d86c -e vmlinux-4.19.0-6-amd64                                                        
./debian/build/build_amd64_none_amd64/./arch/x86/include/asm/nospec-branch.h:274

 


static __always_inline
void alternative_msr_write(unsigned int msr, u64 val, unsigned int feature)
{
	asm volatile(ALTERNATIVE("", "wrmsr", %c[feature])
		: : "c" (msr),
		    "a" ((u32)val),
		    "d" ((u32)(val >> 32)),
		    [feature] "i" (feature)
		: "memory");
}

WRMSR — Write to Model Specific Register

OpcodeInstructionOp/En64-Bit ModeCompat/Leg ModeDescription
0F 30WRMSRZOValidValidWrite the value in EDX:EAX to MSR specified by ECX.

Instruction Operand Encoding ¶

Op/EnOperand 1Operand 2Operand 3Operand 4
ZONANANANA

  Writes the contents of registers EDX:EAX into the 64-bit model specific register (MSR) specified in the ECX register. (On processors that support the Intel 64 architecture, the high-order 32 bits of RCX are ignored.) The contents of the EDX register are copied to high-order 32 bits of the selected MSR and the contents of the EAX register are copied to low-order 32 bits of the MSR. (On processors that support the Intel 64 architecture, the high-order 32 bits of each of RAX and RDX are ignored.) Undefined or reserved bits in an MSR should be set to values previously read.

RAX: 0000000000000001 RBX: ffff8e3a8be02200 RCX: 0000000000000049
[   74.766619] RDX: 0000000000000000 RSI: ffff8e3a8be02200 RDI: ffff8e3a8ba2d880
[   74.774612] RBP: ffffffffbd07ca60 R08: ffff8e3a9ee22ae0 R09: 0000000000000000

意思是将0000:0001中的数据写入到 0x49 msr寄存器中. EDX寄存器中的内容0拷贝到MSR 0X49的高32位,EAX的内容1被拷贝到0x49的低32位。也就是往0x49 msr里面写了一个1,就导致异常了

intel 手册,AMD的呢

如果CPUID中支持IBPB_SUPPORT 或者 SPEC_CTRL的话,那么可以通过控制PRED_CMD MSR来控制IBPB功能。

当ibpb_enabled 设置为1的时候,IBPB barrier会在guest mode或user mode(用户态)的上下文切换的时候去刷新间接分支预测器中的内容,以阻止同主机上的其他虚拟机攻击或是同主机上的其他进程攻击

产生GP的原因

1) This instruction must be executed at privilege level 0 or in real-address mode; otherwise, a general protection exception #GP(0) is generated. 指令必须在level 0或者实模式下运行。否则产生GP

2)Specifying a reserved or unimplemented MSR address in ECX will also cause a general protection exception.

在ECX里面指定了一个保留或者没有实现的MSR地址,也会产生GP

3)The processor will also generate a general protection exception if software attempts to write to bits in a reserved MSR. 

往保留的MSR里面写入bit,也会产生GP

后续需要:

1) amd手册中 0x49msr寄存器的含义

2) switch_mm_irqs_off接口到wrmsr接口的流程及是否有什么判断条件才能调用到wrmsr接口。

5.12 内核switch_mm_irqs_off 代码

void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
			struct task_struct *tsk)
{
	struct mm_struct *real_prev = this_cpu_read(cpu_tlbstate.loaded_mm);
	u16 prev_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
	bool was_lazy = this_cpu_read(cpu_tlbstate.is_lazy);
	unsigned cpu = smp_processor_id();
	u64 next_tlb_gen;
	bool need_flush;
	u16 new_asid;

	/*
	 * NB: The scheduler will call us with prev == next when switching
	 * from lazy TLB mode to normal mode if active_mm isn't changing.
	 * When this happens, we don't assume that CR3 (and hence
	 * cpu_tlbstate.loaded_mm) matches next.
	 *
	 * NB: leave_mm() calls us with prev == NULL and tsk == NULL.
	 */

	/* We don't want flush_tlb_func_* to run concurrently with us. */
	if (IS_ENABLED(CONFIG_PROVE_LOCKING))
		WARN_ON_ONCE(!irqs_disabled());

	/*
	 * Verify that CR3 is what we think it is.  This will catch
	 * hypothetical buggy code that directly switches to swapper_pg_dir
	 * without going through leave_mm() / switch_mm_irqs_off() or that
	 * does something like write_cr3(read_cr3_pa()).
	 *
	 * Only do this check if CONFIG_DEBUG_VM=y because __read_cr3()
	 * isn't free.
	 */
#ifdef CONFIG_DEBUG_VM
	if (WARN_ON_ONCE(__read_cr3() != build_cr3(real_prev->pgd, prev_asid))) {
		/*
		 * If we were to BUG here, we'd be very likely to kill
		 * the system so hard that we don't see the call trace.
		 * Try to recover instead by ignoring the error and doing
		 * a global flush to minimize the chance of corruption.
		 *
		 * (This is far from being a fully correct recovery.
		 *  Architecturally, the CPU could prefetch something
		 *  back into an incorrect ASID slot and leave it there
		 *  to cause trouble down the road.  It's better than
		 *  nothing, though.)
		 */
		__flush_tlb_all();
	}
#endif
	this_cpu_write(cpu_tlbstate.is_lazy, false);

	/*
	 * The membarrier system call requires a full memory barrier and
	 * core serialization before returning to user-space, after
	 * storing to rq->curr, when changing mm.  This is because
	 * membarrier() sends IPIs to all CPUs that are in the target mm
	 * to make them issue memory barriers.  However, if another CPU
	 * switches to/from the target mm concurrently with
	 * membarrier(), it can cause that CPU not to receive an IPI
	 * when it really should issue a memory barrier.  Writing to CR3
	 * provides that full memory barrier and core serializing
	 * instruction.
	 */
	if (real_prev == next) {
		VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
			   next->context.ctx_id);

		/*
		 * Even in lazy TLB mode, the CPU should stay set in the
		 * mm_cpumask. The TLB shootdown code can figure out from
		 * from cpu_tlbstate.is_lazy whether or not to send an IPI.
		 */
		if (WARN_ON_ONCE(real_prev != &init_mm &&
				 !cpumask_test_cpu(cpu, mm_cpumask(next))))
			cpumask_set_cpu(cpu, mm_cpumask(next));

		/*
		 * If the CPU is not in lazy TLB mode, we are just switching
		 * from one thread in a process to another thread in the same
		 * process. No TLB flush required.
		 */
		if (!was_lazy)
			return;

		/*
		 * Read the tlb_gen to check whether a flush is needed.
		 * If the TLB is up to date, just use it.
		 * The barrier synchronizes with the tlb_gen increment in
		 * the TLB shootdown code.
		 */
		smp_mb();
		next_tlb_gen = atomic64_read(&next->context.tlb_gen);
		if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) ==
				next_tlb_gen)
			return;

		/*
		 * TLB contents went out of date while we were in lazy
		 * mode. Fall through to the TLB switching code below.
		 */
		new_asid = prev_asid;
		need_flush = true;
	} else {
		/*
		 * Avoid user/user BTB poisoning by flushing the branch
		 * predictor when switching between processes. This stops
		 * one process from doing Spectre-v2 attacks on another.
		 */
		cond_ibpb(tsk); //此处接口为wrmsr的调用者

		/*
		 * Stop remote flushes for the previous mm.
		 * Skip kernel threads; we never send init_mm TLB flushing IPIs,
		 * but the bitmap manipulation can cause cache line contention.
		 */
		if (real_prev != &init_mm) {
			VM_WARN_ON_ONCE(!cpumask_test_cpu(cpu,
						mm_cpumask(real_prev)));
			cpumask_clear_cpu(cpu, mm_cpumask(real_prev));
		}

		/*
		 * Start remote flushes and then read tlb_gen.
		 */
		if (next != &init_mm)
			cpumask_set_cpu(cpu, mm_cpumask(next));
		next_tlb_gen = atomic64_read(&next->context.tlb_gen);

		choose_new_asid(next, next_tlb_gen, &new_asid, &need_flush);

		/* Let nmi_uaccess_okay() know that we're changing CR3. */
		this_cpu_write(cpu_tlbstate.loaded_mm, LOADED_MM_SWITCHING);
		barrier();
	}
}

//下一级函数 都在tlb.c文件中
static void cond_ibpb(struct task_struct *next)
{
	if (!next || !next->mm)
		return;

	/*
	 * Both, the conditional and the always IBPB mode use the mm
	 * pointer to avoid the IBPB when switching between tasks of the
	 * same process. Using the mm pointer instead of mm->context.ctx_id
	 * opens a hypothetical hole vs. mm_struct reuse, which is more or
	 * less impossible to control by an attacker. Aside of that it
	 * would only affect the first schedule so the theoretically
	 * exposed data is not really interesting.
	 */
	if (static_branch_likely(&switch_mm_cond_ibpb)) { // switch_mm_cond_ibpb为关键变量,如果没配置,后续wrmsr就不会被访问到。
		unsigned long prev_mm, next_mm;

		/*
		 * This is a bit more complex than the always mode because
		 * it has to handle two cases:
		 *
		 * 1) Switch from a user space task (potential attacker)
		 *    which has TIF_SPEC_IB set to a user space task
		 *    (potential victim) which has TIF_SPEC_IB not set.
		 *
		 * 2) Switch from a user space task (potential attacker)
		 *    which has TIF_SPEC_IB not set to a user space task
		 *    (potential victim) which has TIF_SPEC_IB set.
		 *
		 * This could be done by unconditionally issuing IBPB when
		 * a task which has TIF_SPEC_IB set is either scheduled in
		 * or out. Though that results in two flushes when:
		 *
		 * - the same user space task is scheduled out and later
		 *   scheduled in again and only a kernel thread ran in
		 *   between.
		 *
		 * - a user space task belonging to the same process is
		 *   scheduled in after a kernel thread ran in between
		 *
		 * - a user space task belonging to the same process is
		 *   scheduled in immediately.
		 *
		 * Optimize this with reasonably small overhead for the
		 * above cases. Mangle the TIF_SPEC_IB bit into the mm
		 * pointer of the incoming task which is stored in
		 * cpu_tlbstate.last_user_mm_ibpb for comparison.
		 */
		next_mm = mm_mangle_tif_spec_ib(next);
		prev_mm = this_cpu_read(cpu_tlbstate.last_user_mm_ibpb);

		/*
		 * Issue IBPB only if the mm's are different and one or
		 * both have the IBPB bit set.
		 */
		if (next_mm != prev_mm &&
		    (next_mm | prev_mm) & LAST_USER_MM_IBPB)
			indirect_branch_prediction_barrier(); //这里是wrmsr

		this_cpu_write(cpu_tlbstate.last_user_mm_ibpb, next_mm);
	}

	if (static_branch_unlikely(&switch_mm_always_ibpb)) { //switch_mm_always_ibpb 这个如果不为真,则后续也被访问不到。
		/*
		 * Only flush when switching to a user space task with a
		 * different context than the user space task which ran
		 * last on this CPU.
		 */
		if (this_cpu_read(cpu_tlbstate.last_user_mm) != next->mm) {
			indirect_branch_prediction_barrier();//这里是wrmsr
			this_cpu_write(cpu_tlbstate.last_user_mm, next->mm);
		}
	}
}
//具体实现代码 在 arch/x86/include/asm/nospec-branch.h
// static __always_inline
void alternative_msr_write(unsigned int msr, u64 val, unsigned int feature)
{
	asm volatile(ALTERNATIVE("", "wrmsr", %c[feature])
		: : "c" (msr),
		    "a" ((u32)val),
		    "d" ((u32)(val >> 32)),
		    [feature] "i" (feature)
		: "memory");
}

static inline void indirect_branch_prediction_barrier(void)
{
	u64 val = PRED_CMD_IBPB;

	alternative_msr_write(MSR_IA32_PRED_CMD, val, X86_FEATURE_USE_IBPB);
}

/x86/include/asm/msr-index.h:#define PRED_CMD_IBPB                     BIT(0)
./x86/include/asm/msr-index.h:#define MSR_IA32_PRED_CMD         0x00000049 /* Prediction Command */
./x86/include/asm/cpufeatures.h:#define X86_FEATURE_USE_IBPB            ( 7*32+21) /* "" Indirect Branch Prediction Barrier enabled */

 上述这段代码和panic打印出来的信息一致。

上述代码中,最关键为 switch_mm_always_ibpb和 switch_mm_cond_ibpb,如果此两者不为真,则不会调用wrmsr 的寄存器。

此两者的赋值在: linux-5.12.2/arch/x86/kernel/cpu/bugs.c,通过代码走读,可以通过 cmdline参数进行控制。

static void __init
spectre_v2_user_select_mitigation(enum spectre_v2_mitigation_cmd v2_cmd)
{
	enum spectre_v2_user_mitigation mode = SPECTRE_V2_USER_NONE;
	bool smt_possible = IS_ENABLED(CONFIG_SMP);
	enum spectre_v2_user_cmd cmd;

	if (!boot_cpu_has(X86_FEATURE_IBPB) && !boot_cpu_has(X86_FEATURE_STIBP))
		return;

	if (cpu_smt_control == CPU_SMT_FORCE_DISABLED ||
	    cpu_smt_control == CPU_SMT_NOT_SUPPORTED)
		smt_possible = false;

	cmd = spectre_v2_parse_user_cmdline(v2_cmd);
	........

	/* Initialize Indirect Branch Prediction Barrier */
	if (boot_cpu_has(X86_FEATURE_IBPB)) { //此处先判断是否有IBPB功能。
		setup_force_cpu_cap(X86_FEATURE_USE_IBPB);

		spectre_v2_user_ibpb = mode;
		switch (cmd) {
		case SPECTRE_V2_USER_CMD_FORCE:
		case SPECTRE_V2_USER_CMD_PRCTL_IBPB:
		case SPECTRE_V2_USER_CMD_SECCOMP_IBPB:
			static_branch_enable(&switch_mm_always_ibpb); //根据不同的值使能变量
			spectre_v2_user_ibpb = SPECTRE_V2_USER_STRICT;
			break;
		case SPECTRE_V2_USER_CMD_PRCTL:
		case SPECTRE_V2_USER_CMD_AUTO:
		case SPECTRE_V2_USER_CMD_SECCOMP:
			static_branch_enable(&switch_mm_cond_ibpb);//根据不同的值使能变量
			break;
		default:
			break; //default跑到这里,则都不启用。
		}

		pr_info("mitigation: Enabling %s Indirect Branch Prediction Barrier\n",
			static_key_enabled(&switch_mm_always_ibpb) ?
			"always-on" : "conditional"); //此处在dmesg中可以看到信息
	}
}

 

实际上  switch_mm_irqs_off 接口是从4.4的某个内核版本才被引入的。https://elixir.bootlin.com/linux/v4.4.271/A/ident/switch_mm_irqs_off 可以搜到,在v4.4的则搜索不到。

那么为什么要引入呢

1. 引入switch_mm_irqs_off

86/mm, sched/core: Turn off IRQs in switch_mm() …

@amlutoIngo Molnar

amluto authored and Ingo Molnar committed on 28 Apr 2016

是如上的linux 合入的 ,2016-4-28

https://github.com/torvalds/linux/commit/078194f8e9fe3cf54c8fd8bded48a1db5bd8eb8a#

https://lore.kernel.org/lkml/f19baf759693c9dcae64bbff76189db77cb13398.1461688545.git.luto@kernel.org/

2. 重构代码switch_mm_irqs_off

https://github.com/torvalds/linux/commit/12c4d978fd170ccdd7260ec11f93b11e46904228#diff-4e0003dcf59b07d578812fdb1612c0bce74abb2aac2d64d34c7c31c84fca6e5b

3. 引入 ibpb

https://github.com/torvalds/linux/commit/4c71a2b6fd7e42814aa68a6dec88abf3b42ea573#diff-4e0003dcf59b07d578812fdb1612c0bce74abb2aac2d64d34c7c31c84fca6e5b

https://lore.kernel.org/lkml/20181125185005.466447057@linutronix.de/  获取cpu 是否使用ibpb的接口 2018-11-25

https://lore.kernel.org/lkml/1517263487-3708-1-git-send-email-dwmw@amazon.co.uk/   29 Jan 2018 22:04:47 +0000

 
 static char *ibpb_state(void)
 {
-	if (boot_cpu_has(X86_FEATURE_USE_IBPB))
-		return ", IBPB";
-	else
-		return "";
+	if (boot_cpu_has(X86_FEATURE_IBPB)) {
+		switch (spectre_v2_user) {
+		case SPECTRE_V2_USER_NONE:
+			return ", IBPB: disabled";
+		case SPECTRE_V2_USER_STRICT:
+			return ", IBPB: always-on";
+		}
+	}
+	return "";
 }
 

查看spectre_v2是否已经开启

通过命令通过命令查看  grep . /sys/devices/system/cpu/vulnerabilities/*

通过命令查看  grep . /sys/devices/system/cpu/vulnerabilities/*

通过命令查看

增加启动参数 /etc/default/grub

GRUB_CMDLINE_LINUX_DEFAULT="splash quiet nospectre_v2"

 

 grep . /sys/devices/system/cpu/vulnerabilities/*

 

https://software.intel.com/content/www/us/en/develop/articles/software-security-guidance/technical-documentation/

indirect-branch-predictor-barrier.html

switch_mm_irqs_off 一些介绍 https://lwn.net/Articles/763058/

内核对于 幽灵漏洞补丁的控制 : https://www.zdnet.com/article/linux-kernel-gets-another-option-to-disable-spectre-mitigations/

内核spectre cmdline参数说明  https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/spectre.html

wrmsr指令,这篇介绍的很详细,包括错误:https://www.felixcloutier.com/x86/wrmsr

intel手册下载地址: https://software.intel.com/content/www/us/en/develop/articles/intel-sdm.html

amd 手册下载地址: https://developer.amd.com/resources/developer-guides-manuals/

wrmsr指令: https://www.xuebuyuan.com/804478.html

类似问题,但是挂的地方不一样 https://lkml.org/lkml/2019/1/3/540

反汇编及addr2line命令 https://www.jianshu.com/p/db13bddf4bc0

通过cmdlinedisable 内核补丁  https://yux.im/posts/technology/security/disable-meltdown-and-spectre-patches-on-linux/

评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

proware

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值