Hypervisor简单不准确概念就是,启用HV后,会有客户机(guest)和主机(host),客户机的CPU的一些操作会经过一个叫做VMCS的结构(占用一个page大小)交给主机处理再交给客户机.如果你用过vmware 那么主机就是你现在的电脑,客户端就是你开的虚拟机里面的东西.介于国内这方面资料很少,所以在这边做个记录.
由于虚拟化技术分intel的VT与AMD的SVM 篇幅有限 本文暂时介绍VT,SVM会在稍后的文章中介绍
由于是第一篇 所有我们的最终目的是
制作一个hypervisor 并且挂钩 SSDT 函数与安排掉常规hypervisor检测
(请注意,本文中的hyperduck是我的项目名字,不要跟hypervisor搞混)
阅读本文 如果你想真正掌握制作一个虚拟机,你需要掌握如下技能:
1. C与C++
2. 基本内核知识
3. RIP ESP 概念
4. 学习的热情
话不多说 开始
如之前所说,SVM跟VT的架构不一样 ,AMD跟INTEL的CPU进入虚拟化的方式是不一样的,第一步我们就需要通过cpuid函数判断CPU类型:
int get_cpu_type() { _cpuid data = { 0 }; char vendor[0x20] = { 0 }; __cpuid((int*)&data, 0); *(int*)(vendor) = data.Rbx; *(int*)(vendor + 4) = data.Rdx; *(int*)(vendor + 8) = data.Rcx; if (memcmp(vendor, "GenuineIntel", 12) == 0) { global::cpu_type = _cpu_intel; return _cpu_intel; } if (memcmp(vendor, "AuthenticAMD", 12) == 0) { global::cpu_type = _cpu_amd; return _cpu_amd; } DebugPrint("[DebugMessage] Unknown CPU Detected! %s \n", vendor); global::cpu_type = _cpu_unk; return _cpu_unk; }
其实就是读ia32_feature_control msr寄存器然后修改一个enablevmxon位再写回去就行.请注意,VMXON有时候会被主板锁定,需要自己在bios里面设置打开VT
bool enable_vmx_operation() { _cpuid data = { 0 }; __cpuid((int*)&data, 1); if ((data.Rcx & (1 << 5)) == 0) return false; IA32_FEATURE_CONTROL_MSR Control = { 0 }; Control.All = __readmsr(ia32_feature_control); // BIOS lock check if (Control.Fields.Lock == 0) { Control.Fields.Lock = true; Control.Fields.EnableVmxon = true; _huoji_writemsr(ia32_feature_control, Control.All); } else if (Control.Fields.EnableVmxon == false) { DebugPrint("[%s]: VMX locked off in BIOS\n", __FUNCTION__); return false; } return true; }
分配VMContext与VmStack <-我们稍后会在虚拟机里面用到
void vt::allocate_vmm_context() { PHYSICAL_ADDRESS phys = { 0 }; phys.QuadPart = ~0ULL; //global::vm_context = (_vmm_context*)ExAllocatePoolWithTag(NonPagedPool, sizeof(_vmm_context), HUOJI_TAG); global::vm_context = (_vmm_context*)MmAllocateContiguousMemory(sizeof(_vmm_context), phys); RtlSecureZeroMemory(global::vm_context, sizeof(_vmm_context)); global::vm_context->processor_count = KeQueryActiveProcessorCountEx(ALL_PROCESSOR_GROUPS); global::vm_context->vt_vcpu_table = (_vcpu_t**)ExAllocatePoolWithTag(NonPagedPool, sizeof(struct _vcpu_t*) * global::vm_context->processor_count, HUOJI_TAG); global::vm_context->kennel_base = global::kernel_base; global::vm_context->kennel_size = global::kernel_size; DebugPrint("vmm_context allocated at %p\n", global::vm_context); DebugPrint("vcpu_table allocated at %p\n", global::vm_context->vt_vcpu_table); }
请注意,我们用MmAllocateContiguousMemory分配vm_context因为进入VM的时候IRQL级别会大于或者等于DPC LEVEL 因此我们必须要分配一个连续的内存(非连续内存会产生中断然后炸系统,如果不懂是什么意思可以复习一下大学学的操作系统)
分配stack,我们的虚拟机栈,不了解可以百度一下虚拟机栈的概念
vcpu->stack = MmAllocateContiguousMemory(KERNEL_STACK_SIZE, phys);
CR0 包含各种可以修改基本处理器操作的标志。我们将遇到的一个这样的标志是保护启用位,它确定处理器是在实模式还是保护模式下执行
CR4 的VMXE 决定我们是否能启动VM(也叫做 VMX ENABLE)
请记住还有一个CR3,当时目前不需要,以后会需要
void vt::enable_vmx() { uintptr_t cr0 = _huoji_readcr0(); uintptr_t cr4 = _huoji_readcr4(); cr0 |= __readmsr(ia32_vmx_cr0_fixed0); cr0 &= __readmsr(ia32_vmx_cr0_fixed1); cr4 |= __readmsr(ia32_vmx_cr4_fixed0); cr4 &= __readmsr(ia32_vmx_cr4_fixed1); _huoji_writecr0(cr0); _huoji_writecr4(cr4); }
你可以看到我这里用了我都代理函数去做,因为我这边是用LLVM编译的,LLVM没有此类操作(疑似作者偷懒,我这边就用我自己的函数代替了),类似于这样,请不要大惊小怪
static unsigned __int64 _huoji_readcr0(void) { #ifdef _llvm unsigned __int64 result_data = 0; __asm("mov %%cr0, %0" : "=r"(result_data) : : "memory"); return result_data; #else return __readcr0(); #endif } static unsigned __int64 _huoji_readcr4(void) { #ifdef _llvm unsigned __int64 result_data = 0; __asm("mov %%cr4, %0" : "=r"(result_data) : : "memory"); return result_data; #else return __readcr4(); #endif }
在进入VM之前,你需要了解一个很重要的概念,现在cpu都不是单核CPU了,都是多核CPU,因此我们需要同一时间让所有的核心同时执行代码,因此就需要用到DPC KeGenericCallDpc:
void init_vm_dpc_callback(PRKDPC Dpc, PVOID Context, PVOID SystemArgument1, PVOID SystemArgument2) { uintptr_t processor_number = KeGetCurrentProcessorNumber(); _vcpu_t* vcpu = global::vm_context->vt_vcpu_table[processor_number]; RtlCaptureContext(&vcpu->context_frame); if (vcpu->vm_status == 0) { vcpu->vm_status = 1; vt::init_logical_processor(); } else if(vcpu->vm_status == 1) { vcpu->vm_status = 2; DebugPrint("[%d] vm finished! \n", processor_number); vm_restore_context(&vcpu->context_frame); DebugPrint("[%d] vm finished restore contex finished! \n", processor_number); } KeSignalCallDpcSynchronize(SystemArgument2); KeSignalCallDpcDone(SystemArgument1); } ..... KeGenericCallDpc(init_vm_dpc_callback, NULL);
这样所有核心都会同时执行我们的init_vm_dpc_callback函数
我函数里面使用RtlCaptureContext保存上下文(这样子进入VM的时候GUEST RIP就会恢复到这句话下面)
看不懂没关系,我们继续,到时候你就理解含义了
到这里,我们就可以进入VM了:
首先用__vmx_on指令激活这个核心的VM扩展功能(同时他会返回一个VMXON的物理地址,我们记录她)
if (_huoji_vmx_on(&vcpu->vmxon_physical) != 0) { ......失败时候的处理 }
之后我们要初始化VMCS区域,你可以百度这个区域意思
初始化之前,调用_vmx_vmclear清理掉老的VMCS区域内容防止出现冲突:
if ((_huoji_vmx_vmclear(&vcpu->vmcs_physical) != vmx_success) || (_huoji_vmx_vmptrld(&vcpu->vmcs_physical) != vmx_success)) { __debugbreak(); }
调整msr寄存器,VM的接收的vmexit事件等信息都受到这些msr寄存器的影响(比如我们希望接收什么指令的vmexit,是否要做APIC虚拟化等),我们根据需要调整他们:
举个例子,我们要求处理器必须处于长模式下(具体可以百度 实模式、保护模式、长模式各自是什么意思 计算机组成原理的基本课程):
_vt_vmx_entry_control_t entry_controls; entry_controls.control = 0; entry_controls.bits.ia32e_mode_guest = TRUE; vt_vmx_adjust_entry_controls(&entry_controls);
这是我们目前需要调整的(以后):
_vt_vmx_exit_control_t exit_controls; exit_controls.control = 0; exit_controls.bits.host_address_space_size = TRUE; vt_vmx_adjust_exit_controls(&exit_controls); _vt_vmx_pinbased_control_msr_t pinbased_controls; pinbased_controls.control = 0; vt_vmx_adjust_pinbased_controls(&pinbased_controls); _vt_vmx_primary_processor_based_control_t primary_controls; primary_controls.control = 0; primary_controls.bits.use_msr_bitmaps = TRUE; primary_controls.bits.active_secondary_controls = TRUE; //primary_controls.bits.rdtsc_exiting = TRUE; //rdtsc vt_vmx_adjust_processor_based_controls(&primary_controls); _vt_vmx_secondary_processor_based_control_t secondary_controls; secondary_controls.control = 0; secondary_controls.bits.enable_rdtscp = TRUE; secondary_controls.bits.enable_xsave_xrstor = TRUE; secondary_controls.bits.enable_invpcid = TRUE; vt_vmx_adjust_secondary_controls(&secondary_controls);
调整代码(其他代码无非修改寄存器,参考intel白皮书):
uintptr_t vt_vmx_adjust_cv(unsigned int capability_msr, unsigned int value) { union _vt_vmx_true_control_settings_t cap; unsigned int actual; cap.control = __readmsr(capability_msr); actual = value; actual |= cap.allowed_0_settings; actual &= cap.allowed_1_settings; return actual; } void vt_vmx_adjust_entry_controls(union _vt_vmx_entry_control_t* entry_controls) { unsigned int capability_msr; union _vt_vmx_basic_msr_t basic; basic.control = __readmsr(ia32_vmx_basic); capability_msr = (basic.bits.true_controls != FALSE) ? ia32_vmx_true_entry_ctrl : ia32_vmx_entry_ctrl; entry_controls->control = vt_vmx_adjust_cv(capability_msr, entry_controls->control); _huoji_vmx_vmwrite(pin_based_vm_execution_controls, entry_controls->control); }
保存到VMCS:
_huoji_vmx_vmwrite(pin_based_vm_execution_controls, pinbased_controls.control); _huoji_vmx_vmwrite(primary_processor_based_vm_execution_controls, primary_controls.control); _huoji_vmx_vmwrite(secondary_processor_based_vm_execution_controls, secondary_controls.control); _huoji_vmx_vmwrite(vmexit_controls, exit_controls.control); _huoji_vmx_vmwrite(vmentry_controls, entry_controls.control); _huoji_vmx_vmwrite(cr0_guest_host_mask, 0x80000021); // Monitor PE, NE and PG flags _huoji_vmx_vmwrite(cr4_guest_host_mask, 0x2000); // Monitor VMXE flags
设置段寄存器,
// Guest State Area - CS Segment _huoji_vmx_vmwrite(guest_cs_selector, state_p.cs.selector); _huoji_vmx_vmwrite(guest_cs_limit, state_p.cs.limit); _huoji_vmx_vmwrite(guest_cs_access_rights, vt_attrib(state_p.cs.selector, state_p.cs.attrib)); _huoji_vmx_vmwrite(guest_cs_base, (uintptr_t)state_p.cs.base); // Guest State Area - DS Segment _huoji_vmx_vmwrite(guest_ds_selector, state_p.ds.selector); _huoji_vmx_vmwrite(guest_ds_limit, state_p.ds.limit); _huoji_vmx_vmwrite(guest_ds_access_rights, vt_attrib(state_p.ds.selector, state_p.ds.attrib)); _huoji_vmx_vmwrite(guest_ds_base, (uintptr_t)state_p.ds.base); // Guest State Area - ES Segment _huoji_vmx_vmwrite(guest_es_selector, state_p.es.selector); _huoji_vmx_vmwrite(guest_es_limit, state_p.es.limit); _huoji_vmx_vmwrite(guest_es_access_rights, vt_attrib(state_p.es.selector, state_p.es.attrib)); _huoji_vmx_vmwrite(guest_es_base, (uintptr_t)state_p.es.base); // Guest State Area - FS Segment _huoji_vmx_vmwrite(guest_fs_selector, state_p.fs.selector); _huoji_vmx_vmwrite(guest_fs_limit, state_p.fs.limit); _huoji_vmx_vmwrite(guest_fs_access_rights, vt_attrib(state_p.fs.selector, state_p.fs.attrib)); _huoji_vmx_vmwrite(guest_fs_base, (uintptr_t)state_p.fs.base); // Guest State Area - GS Segment _huoji_vmx_vmwrite(guest_gs_selector, state_p.gs.selector); _huoji_vmx_vmwrite(guest_gs_limit, state_p.gs.limit); _huoji_vmx_vmwrite(guest_gs_access_rights, vt_attrib(state_p.gs.selector, state_p.gs.attrib)); _huoji_vmx_vmwrite(guest_gs_base, (uintptr_t)state_p.gs.base); // Guest State Area - SS Segment _huoji_vmx_vmwrite(guest_ss_selector, state_p.ss.selector); _huoji_vmx_vmwrite(guest_ss_limit, state_p.ss.limit); _huoji_vmx_vmwrite(guest_ss_access_rights, vt_attrib(state_p.ss.selector, state_p.ss.attrib)); _huoji_vmx_vmwrite(guest_ss_base, (uintptr_t)state_p.ss.base); // Guest State Area - Task Register _huoji_vmx_vmwrite(guest_tr_selector, state_p.tr.selector); _huoji_vmx_vmwrite(guest_tr_limit, state_p.tr.limit); _huoji_vmx_vmwrite(guest_tr_access_rights, vt_attrib(state_p.tr.selector, state_p.tr.attrib)); _huoji_vmx_vmwrite(guest_tr_base, (uintptr_t)state_p.tr.base); // Guest State Area - Local Descriptor Table Register _huoji_vmx_vmwrite(guest_ldtr_selector, state_p.ldtr.selector); _huoji_vmx_vmwrite(guest_ldtr_limit, state_p.ldtr.limit); _huoji_vmx_vmwrite(guest_ldtr_access_rights, vt_attrib(state_p.ldtr.selector, state_p.ldtr.attrib)); _huoji_vmx_vmwrite(guest_ldtr_base, (uintptr_t)state_p.ldtr.base); // Guest State Area - IDTR and GDTR _huoji_vmx_vmwrite(guest_gdtr_base, (uintptr_t)state_p.gdtr.base); _huoji_vmx_vmwrite(guest_idtr_base, (uintptr_t)state_p.idtr.base); _huoji_vmx_vmwrite(guest_gdtr_limit, state_p.gdtr.limit); _huoji_vmx_vmwrite(guest_idtr_limit, state_p.idtr.limit); // Guest State Area - Control Registers _huoji_vmx_vmwrite(guest_cr0, state_p.cr0); _huoji_vmx_vmwrite(guest_cr3, state_p.cr3); _huoji_vmx_vmwrite(guest_cr4, state_p.cr4); _huoji_vmx_vmwrite(cr0_read_shadow, state_p.cr0); _huoji_vmx_vmwrite(cr4_read_shadow, state_p.cr4 & ~ia32_cr4_vmxe_bit); // Guest State Area - Debug Controls _huoji_vmx_vmwrite(guest_dr7, state_p.dr7); _huoji_vmx_vmwrite(guest_msr_ia32_debug_ctrl, state_p.debug_ctrl); // VMCS Link Pointer - Essential for Accelerated VMX Nesting _huoji_vmx_vmwrite(vmcs_link_pointer, 0xffffffffffffffff);
_huoji_vmx_vmwrite(host_cs_selector, state_p.cs.selector & selector_mask);
_huoji_vmx_vmwrite(host_ds_selector, state_p.ds.selector & selector_mask);
_huoji_vmx_vmwrite(host_es_selector, state_p.es.selector & selector_mask);
_huoji_vmx_vmwrite(host_fs_selector, state_p.fs.selector & selector_mask);
_huoji_vmx_vmwrite(host_gs_selector, state_p.gs.selector & selector_mask);
_huoji_vmx_vmwrite(host_ss_selector, state_p.ss.selector & selector_mask);
_huoji_vmx_vmwrite(host_tr_selector, state_p.tr.selector & selector_mask);
// Host State Area - Segment Bases
_huoji_vmx_vmwrite(host_fs_base, (uintptr_t)state_p.fs.base);
_huoji_vmx_vmwrite(host_gs_base, (uintptr_t)state_p.gs.base);
_huoji_vmx_vmwrite(host_tr_base, (uintptr_t)state_p.tr.base);
// Host State Area - Descriptor Tables
_huoji_vmx_vmwrite(host_gdtr_base, (uintptr_t)state_p.gdtr.base);
_huoji_vmx_vmwrite(host_idtr_base, (uintptr_t)state_p.idtr.base);
_huoji_vmx_vmwrite(host_cr0, state_p.cr0);
_huoji_vmx_vmwrite(host_cr3, __readcr3());
_huoji_vmx_vmwrite(host_cr4, state_p.cr4);
这就是为什么我们前面要调用RtlCaptureContext保存上下文
// Guest State Area - Flags, Stack Pointer, Instruction Pointer _huoji_vmx_vmwrite(guest_rsp, vcpu->context_frame.Rsp); _huoji_vmx_vmwrite(guest_rip, vcpu->context_frame.Rip);
设置HOST机堆栈(就是我们前面申请的那个)
_huoji_vmx_vmwrite(host_rsp, (ULONG_PTR)vcpu->stack + KERNEL_STACK_SIZE - sizeof(CONTEXT));
设置HOST机的RIP:
_huoji_vmx_vmwrite(host_rip, (uintptr_t)vmm_entrypoint);
然后就是启动虚拟机:
int status = _huoji_vmx_vmlaunch(); /* * 这些代码不会执行了,执行了就是有问题的,代码会执行到GUEST_RIP去了 */ if (status != vmx_success) { int vmx_error; _huoji_vmread(vm_instruction_error, (size_t*)&vmx_error); DebugPrint("Failed at VM-Entry, Code=%d\t Reason: %s\n", vmx_error, vt_error_message[vmx_error]); return false; } __debugbreak();
请注意,_vmx_vmlaunch后,下面的代码将不会执行,而是跑去前面的RtlCaptureContext保存的位置去了(我们之前 _huoji_vmx_vmwrite(guest_rip, vcpu->context_frame.Rip)写的就是那个地方)
我们的vm_entrypoin是当出现VMEXIT事件的时候执行的地方,汇编代码如下:
vmm_entrypoint proc push rcx lea rcx, [rsp+8h] call RtlCaptureContext jmp vmexit_handler ; RESTORE_GP ; vmresume vmm_entrypoint endp
可以看到 我这边使用RtlCaptureContext保存当前的上下文环境,另外,我们这边破坏了context->rcx的值,在vmexit_handler里面我们要恢复她
DECLSPEC_NORETURN EXTERN_C VOID vmexit_handler(CONTEXT* context) { context->Rcx = *(PULONG64)((ULONG_PTR)context - sizeof(context->Rcx)); ...... }
至此,我们成功进入虚拟机.
其他部分会在接下来的文章中说明