三星 RKP 纲要
这篇博文的目的是提供三星 RKP 内部工作原理的全面参考。它使任何人都可以开始戳这个在其设备上以高权限级别执行的晦涩代码。此外,还揭示了一个现已修复的漏洞,该漏洞允许在Samsung RKP中执行代码。这是一个很好的例子,说明一个危及平台安全性的简单错误,因为该漏洞利用由单个调用组成,而这只需要一个调用即可从内核写入虚拟机管理程序内存。
目录
介绍
内核开发
开始
Exynos 设备
骁龙设备
符号和日志字符串
虚拟机管理程序速成班
我们的研究平台
提取二进制文件
虚拟机管理程序框架
APP_INIT
APP_RKP
记忆列表
稀疏映射
关键部分
公用设施结构
系统初始化
应用初始化
异常处理
深入研究 RKP
保护内核数据
修改页表
凭据保护
挂载命名空间保护
JOPP 和 ROPP 命令
第一级
第二级
第三级
启动后的整体状态
RKP 开始
RKP 延迟启动
RKP 位图
启动
页表处理
RKP 和 KDP 命令
脆弱性
结论
引用
在第一部分中,我们将简要讨论三星的内核缓解措施(可能值得写一篇自己的博客文章)。在第二部分中,我们将解释如何为您的设备获取 RKP 二进制文件。
在第三部分中,我们将开始拆解在 Exynos 设备上支持 RKP 的虚拟机管理程序框架,然后在第四部分中深入研究 RKP 的内部结构。我们将详细介绍它是如何启动的,它如何处理内核页表,它如何保护敏感的数据结构,最后,它如何启用内核缓解措施。
在第五部分也是最后一部分中,我们将揭示漏洞,单行漏洞,并查看补丁。
二进制漏洞课程(更新中)
在移动设备领域,安全性传统上依赖于内核机制。但历史告诉我们,内核远非牢不可破。对于大多数 Android 设备,发现内核漏洞可让攻击者修改敏感的内核数据结构、提升权限和执行恶意代码。
在启动时确保内核完整性也是不够的(使用验证启动机制)。还必须在运行时验证内核完整性。这就是安全虚拟机管理程序的目标。RKP 代表实时内核保护,是三星虚拟机管理程序实现的名称,它是三星 KNOX 的一部分。
关于三星 RKP 已经做了很多伟大的研究,特别是 Gal Beniamini 的 Lifting the (Hyper) Visor:Bypassing Samsung's Real-Time Kernel Protection 和 Aris Thallas 的 On emulating hypervisors:a Samsung RKP 案例研究,我们强烈建议您在这篇博文之前阅读。
Android 上的典型本地权限提升 (LPE) 流程包括:
通过泄漏内核指针绕过 KASLR;
获取一次性任意内核内存读/写;
用它来覆盖内核函数指针;
调用函数将 设置为 -1;address_limit
通过编写绕过 SELinuxselinux_(enable|enforcing)
;
通过编写 、 、 、 功能等来提升权限。uid
gid
sid
三星已经实施了缓解措施,试图让攻击者尽可能困难地完成这项任务:JOPP、ROPP 和 KDP 就是其中的三个。不过,并非所有三星设备都有相同的缓解措施。
以下是我们在下载各种固件更新后观察到的情况:
装置 | 地区 | 乔普 | 罗普 | ND5型 |
---|
低端 | 国际 | 不 | 不 | 是的 |
低端 | 美国 | 不 | 不 | 是的 |
高端 | 国际 | 是的 | 不 | 是的 |
高端 | 美国 | 是的 | 是的 | 是的 |
乔普¶
面向跳转的编程预防 (JOPP) 旨在防止 JOP。这是一个自制的CFI解决方案。它首先在每个函数开始使用修改后的编译器工具链之前插入 NOP 指令。然后,它使用 Python 脚本 (scripts/rkp_cfp/instrument.py
) 来处理编译的内核二进制文件:NOP 替换为魔术值 (0xbe7bad),间接分支替换为辅助函数的直接分支。
辅助函数 (in ) 将检查目标之前的值是否与魔术值匹配,如果匹配,则跳转,如果不匹配,则崩溃:jopp_springboard_blr_rX
init/rkp_cfp.S
▸ init/rkp_cfp.S
.macro springboard_blr, reg
jopp_springboard_blr_\reg:
push RRX, xzr
ldr RRX_32, [\reg, #-4]
subs RRX_32, RRX_32, #0xbe7, lsl #12
cmp RRX_32, #0xbad
b.eq 1f
...
inst 0xdeadc0de //crash for sure
...
1:
pop RRX, xzr
br \reg
.endm
罗普¶
面向退货的编程预防 (ROPP) 旨在防止 ROP。这是一个自制的“堆栈金丝雀”。它使用相同的修改编译器工具链在指令之前和指令之后发出 NOP 指令,并防止分配寄存器 X16 和 X17。然后,它使用相同的 Python 脚本来替换组装的 C 函数的序言和尾声,如下所示:stp x29, x30
ldp x29, x30
nop
stp x29, x30, [sp,#-<frame>]!
(insns)
ldp x29, x30, ...
nop
替换为
eor RRX, x30, RRK
stp x29, RRX, [sp,#-<frame>]!
(insns)
ldp x29, RRX, ...
eor x30, RRX, RRK
其中 是 X16 和 X17 的别名。RRX
RRK
RRK 称为“线程键”,对于每个内核任务都是唯一的。他们不会直接将返回地址推送到堆栈上,而是首先使用此密钥对其进行异型或操作,从而防止攻击者在不知道线程密钥的情况下更改返回地址。
线程键本身存储在结构的字段中,但使用 RRMK 进行异运。rrk
thread_info
▸ arch/arm64/include/asm/thread_info.h
struct thread_info {
// ...
unsigned long rrk;
};
RRMK 称为“主密钥”。在生产设备上,它存储在系统寄存器调试断点控制寄存器 5 (DBGBCR5_EL1
) 中。它是由虚拟机管理程序在内核初始化期间设置的,我们将在后面看到。
ND5型¶
内核数据保护 (KDP) 是另一种支持虚拟机监控程序的缓解措施。它是一种自制的数据流完整性 (DFI) 解决方案。由于虚拟机管理程序,它使敏感的内核数据结构(如页表、、、SELinux 状态等)成为只读。struct cred
struct task_security_struct
struct vfsmount
虚拟机管理程序速成班¶
要了解 Samsung RKP,您需要了解有关 ARMv8 平台上的虚拟化扩展的一些基本知识。我们建议您阅读《提升(超)遮阳板》中的“HYP 101”部分或《模拟虚拟机管理程序》中的“ARM 架构和虚拟化扩展”部分。
用这些章节来解释,虚拟机管理程序以比内核更高的权限级别执行,使其能够完全控制内核。下面是 ARMv8 平台上的体系结构:Here is what the architecture looks like on ARMv<> platforms:
虚拟机管理程序可以通过虚拟机管理程序调用 (HVC) 指令接收来自内核的调用。此外,通过使用虚拟机管理程序配置寄存器(HCR),虚拟机管理程序可以捕获通常由内核处理的关键操作(访问虚拟内存控制寄存器等),还可以处理一般异常。
最后,虚拟机管理程序正在利用第二层地址转换,称为“第 2 阶段转换”。在标准的“第 1 阶段转换”中,虚拟地址 (VA) 被转换为中间物理地址 (IPA)。然后,该 IPA 在第二阶段转换为最终的物理地址 (PA)。
以下是启用 2 阶段地址转换后的地址转换外观:
虚拟机管理程序仍然只有用于其自身内存访问的单级地址转换。
我们的研究平台¶
为了更容易开始这项研究,我们一直在使用引导加载程序解锁的三星 A51 () 而不是完整的漏洞利用链。我们已经从三星开源网站下载了我们设备的内核源代码,对其进行了修改,并重新编译了它(这不能开箱即用)。SM-A515F
在这项研究中,我们实现了新的系统调用:
内核内存分配/释放;
内核内存的任意读/写;
虚拟机监控程序调用(使用函数)。uh_call
这些系统调用使得与 RKP 交互变得非常方便,正如您将在漏洞利用部分看到的那样:我们只需要编写一段 C 代码(或 Python),它将在用户空间中执行并执行我们想要的任何操作。
提取二进制文件¶
RKP 是针对配备 Exynos 和 Snapdragon 的设备实现的,并且这两种实现都共享大量代码。然而,大多数(如果不是全部)现有研究都是在 Exynos 变体上完成的,因为它是最直接深入研究的:RKP 可作为独立的二进制文件使用。在骁龙设备上,它嵌入在高通虚拟机管理程序执行环境 (QHEE) 映像中,该映像非常大且复杂。
Exynos 设备¶
在 Exynos 设备上,RKP 过去直接嵌入到内核二进制文件中,因此可以作为内核源存档中的文件找到。大约在 2017 年底/2018 年初,VMM 被重写为一个名为 uH 的新框架,它很可能代表“微虚拟机管理程序”。因此,二进制文件已重命名为,并且仍然可以在一些设备的内核源代码存档中找到。vmm.elf
uh.elf
由于 Gal Beniamini 首次提出了设计改进建议,在大多数设备上,RKP 已从内核二进制文件中移出,并进入了一个名为 .这使得提取变得更加容易,例如,通过从固件更新中包含的存档中获取它(它通常是 LZ4 压缩的,并以 0x1000 字节的标头开头,需要剥离该标头才能获得真正的 ELF 文件)。uh
BL_xxx.tar
S20 及更高版本设备上的架构略有变化,因为三星引入了另一个支持 RKP 的框架(称为 ),最有可能将代码库与 Snapdragon 设备进一步统一,并且它还具有更多的 uH“应用程序”。但是,我们不会在这篇博文中介绍它。H-Arx
骁龙设备¶
在 Snapdragon 设备上,RKP 可以在分区中找到,也可以从固件更新的存档中提取。它是构成QHEE形象的片段之一。hyp
BL_xxx.tar
与 Exynos 设备的主要区别在于,QHEE 设置页表和异常向量。因此,当发生异常(HVC 或被困系统寄存器)时,是 QHEE 通知 uH,而 uH 在想要修改页表时必须调用 QHEE。代码的其余部分几乎相同。
符号和日志字符串¶
早在 2017 年,RKP 二进制文件就附带了符号和日志字符串。但现在情况已不再如此。如今,二进制文件被剥离,日志字符串被替换为占位符(就像 Qualcomm 所做的那样)。尽管如此,我们还是尝试获得尽可能多的二进制文件,希望三星不会像其他 OEM 有时那样为他们的所有设备这样做。
通过为各种 Exynos 设备批量下载固件更新,我们收集了大约 300 个独特的虚拟机管理程序二进制文件。没有一个文件有符号,所以我们不得不从旧文件手动移植它们。某些文件具有完整的日志字符串,最新的是 。uh.elf
vmm.elf
uh.elf
Apr 9 2019
有了完整的日志字符串及其哈希版本,我们可以发现哈希值只是 SHA256 输出的截断。这里有一个 Python 单行代码来计算哈希值,以备不时之需:
hashlib.sha256(log_string).hexdigest()[:8]
uH框架作为一个微操作系统,其中RKP是一个应用程序。这实际上更像是一种组织事物的方式,因为“应用程序”只是一堆命令处理程序,没有任何隔离。
公用设施结构¶
在深入研究代码之前,我们将简要介绍 uH 和 RKP 应用程序广泛使用的实用程序结构。我们不会详细介绍它们的实现,但了解它们的作用很重要。
记忆列表¶
memlist_t
结构是地址范围列表,是C++向量的一种特殊版本(它具有容量和大小)。
typedef struct memlist_entry {
uint64_t addr;
uint64_t size;
uint64_t unkn_10;
uint64_t extra;
} memlist_entry_t;
typedef struct memlist {
memlist_entry_t* base;
uint32_t capacity;
uint32_t count;
uint32_t merged;
crit_sec_t cs;
} memlist_t;
有一些函数可以在内存列表中添加和删除地址范围,检查地址是否包含在内存列表中,地址范围是否与内存列表重叠等。
稀疏映射¶
sparsemap_t
结构是将值与地址相关联的映射。它是从一个忆因列表创建的,并将此忆因列表中的所有地址映射到一个值。此值的大小由字段确定。bit_per_page
typedef struct sparsemap_entry {
uint64_t addr;
uint64_t size;
uint64_t bitmap_size;
uint8_t* bitmap;
} sparsemap_entry_t;
typedef struct sparsemap {
char name[8];
uint64_t start_addr;
uint64_t end_addr;
uint64_t count;
uint64_t bit_per_page;
uint64_t mask;
crit_sec_t cs;
memlist_t* list;
sparsemap_entry_t* entries;
uint32_t private;
uint32_t unkn_54;
} sparsemap_t;
有一些函数可以获取和设置地图的每个条目的值等。
关键部分¶
crit_sec_t
结构用于实现关键部分。
typedef struct crit_sec {
uint32_t cpu;
uint32_t lock;
uint64_t lr;
} crit_sec_t;
当然,还有进入和退出关键部分的功能。
系统初始化¶
uH/RKP 由 Samsung Bootloader (S-Boot) 加载到内存中。S-Boot 通过要求安全监视器(在 EL2 上运行)在其指定的地址开始执行虚拟机管理程序代码来跳转到 EL3 入口点。
uint64_t cmd_load_hypervisor() {
// ... part = FindPartitionByName("UH");
if (part) {
dprintf("%s: loading uH image from %d..\n", "f_load_hypervisor", part->block_offset);
ReadPartition(&hdr, part->file_offset, part->block_offset, 0x4c);
dprintf("[uH] uh page size = 0x%x\n", (((hdr.size - 1) >> 12) + 1) << 12);
total_size = hdr.size + 0x1210;
dprintf("[uH] uh total load size = 0x%x\n", total_size);
if (total_size > 0x200000 || hdr.size > 0x1fedf0) {
dprintf("Could not do normal boot.(invalid uH length)\n");
// ...
}
ret = memcmp_s(&hdr, "GREENTEA", 8);
if (ret) {
ret = -1;
dprintf("Could not do uh load. (invalid magic)\n");
// ...
} else {
ReadPartition(0x86fff000, part->file_offset, part->block_offset, total_size);
ret = pit_check_signature(part->partition_name, 0x86fff000, total_size);
if (ret) {
dprintf("Could not do uh load. (invalid signing) %x\n", ret);
// ...
}
load_hypervisor(0xc2000400, 0x87001000, 0x2000, 1, 0x87000000, 0x100000);
dprintf("[uH] load hypervisor\n");
}
} else {
ret = -1;
dprintf("Could not load uH. (invalid ppi)\n");
// ...
}
return ret;
}
void load_hypervisor(...) {
dsb();
asm("smc #0");
isb();
}
请注意,在最近的三星设备上,基于 ARM 可信固件 (ATF) 的显示器代码在 S-Boot 二进制文件中不再是纯文本的。取而代之的是,可以找到一个加密的 blob。需要找到三星可信操作系统实现 (TEEGRIS) 中的漏洞,以便可以转储纯文本监视器代码。
EL1 访问的地址转换过程有两个阶段,而 EL2 访问的 AT 过程只有一个阶段。在虚拟机管理程序代码中,阶段 1(缩写)是指控制虚拟机管理程序访问的 EL2 AT 进程的第一阶段。第 2 阶段(缩写)是指控制内核访问的 EL1 AT 进程的第二阶段。s1
s2
在默认
函数中开始执行。此函数在调用 main
之前检查它是否在 EL2 上运行。一旦 main
返回,它就会生成一个 SMC,大概是为了将控制权交还给 S-Boot。
void default(...) {
// ... if (get_current_el() == (0b10 /* EL2 */ << 2)) {
// Save registers x0 to x30, sp_el1, elr_el2, spsr_el2.
// ...
// Reset the .bss section.
memset(&rkp_bss_start, 0, 0x1000);
main(saved_regs.x0, saved_regs.x1, &saved_regs);
}
// Return to S-Boot after initialization.
asm("smc #0");
}
禁用对齐检查并确保二进制文件加载到预期地址(此二进制文件为 0x87000000)后,main
将设置到其初始页表并调用 s1_enable
以在 EL2 上启用地址转换。EL2 的初始页表直接嵌入在虚拟机管理程序二进制文件中,包含 uH 区域的 1:1 映射。TTBR0_EL2
int32_t main(int64_t x0, int64_t x1, saved_regs_t* regs) {
// ... // SCTLR_EL2, System Control Register (EL2).
//
// - A, bit [1] = 0: Alignment fault checking disabled.
// - SA, bit [3] = 0: SP Alignment check disabled.
set_sctlr_el2(get_sctlr_el2() & 0xfffffff5);
// Prevent the hypervisor from being initialized twice.
if (!initialized) {
initialized = 1;
// Check if the loading address is as expected.
if (&hyp_base != 0x87000000) {
uh_log('L', "slsi_main.c", 326, "[-] static s1 mmu mismatch");
return -1;
}
// Set the EL2 page tables start address.
set_ttbr0_el2(&static_s1_page_tables_start__);
// Enable the EL2 address translation.
s1_enable();
// Initialize the hypervisor.
uh_init(0x87000000, 0x200000);
// Initialize the virtual memory manager (VMM).
if (vmm_init()) {
return -1;
}
uh_log('L', "slsi_main.c", 338, "[+] vmm initialized");
// Set the second stage EL1 page tables start address.
set_vttbr_el2(&static_s2_page_tables_start__);
uh_log('L', "slsi_main.c", 348, "[+] static s2 mmu initialized");
// Enable the second stage of EL1 address translation.
s2_enable();
uh_log('L', "slsi_main.c", 351, "[+] static s2 mmu enabled");
}
uh_log('L', "slsi_main.c", 355, "[*] initialization completed");
return 0;
}
s1_enable
主要设置与缓存相关的字段 、 和 ,最重要的是,为 EL2 启用 MMU。然后,main
调用 uh_init
函数,并向其传递 uH 内存范围。似乎 Gal Beniamini 的第二个建议设计改进,将 WXN 位设置为 1,也已由三星 KNOX 团队实现。MAIR_EL2
TCR_EL2
SCTLR_EL2
void s1_enable() {
// ... cs_init(&s1_lock);
// MAIR_EL2, Memory Attribute Indirection Register (EL2).
//
// - Attr0, bits[7:0] = 0xff: Normal memory, Outer & Inner Write-Back Non-transient, Outer & Inner Read-Allocate
// Write-Allocate).
// - Attr1, bits[15:8] = 0x00: Device-nGnRnE memory.
// - Attr2, bits[23:16] = 0x44: Normal memory, Outer & Inner Write-Back Transient, Outer & Inner No Read-Allocate No
// Write-Allocate).
set_mair_el2(get_mair_el2() & 0xffffffffff000000 | 0x4400ff);
// TCR_EL2, Translation Control Register (EL2).
//
// - T0SZ, bits [5:0] = 24: TTBR0_EL2 region size is 2^40.
// - IRGN0, bits [9:8] = 0b11: Normal memory, Outer & Inner Write-Back Read-Allocate No Write-Allocate Cacheable.
// - ORGN0, bits [11:10] = 0b11: Normal memory, Outer & Inner Write-Back Read-Allocate No Write-Allocate Cacheable.
// - SH0, bits [13:12] = 0b11: Inner Shareable.
// - PS, bits [18:16] = 0b010: PA size is 40 bits, 1TB.
set_tcr_el2(get_tcr_el2() & 0xfff8c0c0 | 0x23f18);
flush_entire_cache();
sctlr_el2 = get_sctlr_el2();
// SCTLR_EL2, System Control Register (EL2).
//
// - C, bit [2] = 1: data is cacheable for EL2.
// - I, bit [12] = 1: instruction access is cacheable for EL2.
// - WXN, bit [19] = 1: writeable implies non-executable for EL2.
set_sctlr_el2(sctlr_el2 & 0xfff7effb | 0x81004);
invalidate_entire_s1_el2_tlb();
// - M, bit [0] = 1: EL2 stage 1 address translation enabled.
set_sctlr_el2(sctlr_el2 & 0xfff7effa | 0x81005);
}
将参数保存到名为 的全局控制结构中后,uh_init
调用 。此函数还将其参数保存到全局变量中,并使用跨越虚拟机监控程序内存范围的单个可用块初始化堆块的双向链表。uh_state
static_heap_initialize
然后,uh_init
调用从内存中删除静态堆分配器可以返回的三个重要范围(有效地将原始块拆分为多个块):static_heap_remove_range
int64_t uh_init(int64_t uh_base, int64_t uh_size) {
// ... // Reset the global state of the hypervisor.
memset(&uh_state.base, 0, sizeof(uh_state));
// Save the hypervisor base address and size.
uh_state.base = uh_base;
uh_state.size = uh_size;
// Initialize the static heap with the whole hypervisor memory.
static_heap_initialize(uh_base, uh_size);
// But remove the log, uH and bigdata regions from it.
if (!static_heap_remove_range(0x87100000, 0x40000) || !static_heap_remove_range(&hyp_base, 0x87046000 - &hyp_base) ||
!static_heap_remove_range(0x870ff000, 0x1000)) {
uh_panic();
}
// Initialize the log region.
memory_init();
uh_log('L', "main.c", 131, "================================= LOG FORMAT =================================");
uh_log('L', "main.c", 132, "[LOG:L, WARN: W, ERR: E, DIE:D][Core Num: Log Line Num][File Name:Code Line]");
uh_log('L', "main.c", 133, "==============================================================================");
uh_log('L', "main.c", 134, "[+] uH base: 0x%p, size: 0x%lx", uh_state.base, uh_state.size);
uh_log('L', "main.c", 135, "[+] log base: 0x%p, size: 0x%x", 0x87100000, 0x40000);
uh_log('L', "main.c", 137, "[+] code base: 0x%p, size: 0x%p", &hyp_base, 0x46000);
uh_log('L', "main.c", 139, "[+] stack base: 0x%p, size: 0x%p", stacks, 0x10000);
uh_log('L', "main.c", 143, "[+] bigdata base: 0x%p, size: 0x%p", 0x870ffc40, 0x3c0);
uh_log('L', "main.c", 152, "[+] date: %s, time: %s", "Feb 27 2020", "17:28:58");
uh_log('L', "main.c", 153, "[+] version: %s", "UH64_3b7c7d4f exynos9610");
// Register the command handlers for the INIT app.
uh_register_commands(0, init_cmds, 0, 5, 1);
// Register the command handlers for the RKP app.
j_rkp_register_commands();
uh_log('L', "main.c", 370, "%d app started", 1);
// Initialize the INIT app.
system_init();
// Initialize the other apps (including the RKP app).
apps_init();
// Initialize the bigdata region.
uh_init_bigdata();
// Initialize the context buffer.
uh_init_context();
// Create the memlist of memory regions used by the dynamic heap allocator.
memlist_init(&uh_state.dynamic_regions);
// Create and fill the memlist of protected ranges (critical memory regions).
pa_restrict_init();
// Mark the hypervisor as initialized.
uh_state.inited = 1;
uh_log('L', "main.c", 427, "[+] uH initialized");
return 0;
然后,uh_init
调用 memory_init
,将日志区域清零并将其映射到 EL2 页表中。此区域将由类似 - 的字符串打印函数使用,这些函数在 uh_log
函数内部调用。printf
int64_t memory_init() {
// Reset the log region.
memory_buffer = 0x87100000;
memset(0x87100000, 0, 0x40000);
cs_init(&memory_cs);
clean_invalidate_data_cache_region(0x87100000, 0x40000);
memory_buffer_index = 0;
memory_active = 1;
// Map it into the hypervisor page tables as writable.
return s1_map(0x87100000, 0x40000, UNKN3 | WRITE | READ);
}
然后,uh_init
使用uh_log
记录各种信息(可以从设备上检索这些消息)。然后,uh_init
调用 uh_register_commands
和(最终调用 uh_register_commands
但使用一组不同的参数)。/proc/uh_log
rkp_register_commands
uh_register_commands
将应用程序 ID、命令处理程序数组、可选命令“检查器”函数、数组中的命令数和调试标志作为参数。这些值将存储在结构的字段 、 、 和 中,并将用于处理来自内核的虚拟机监控程序调用。cmd_evtable
cmd_checkers
cmd_counts
cmd_flags
uh_state
int64_t uh_register_commands(uint32_t app_id,
int64_t cmd_array,
int64_t cmd_checker,
uint32_t cmd_count,
uint32_t flag) {
// ... // Ensure the hypervisor hasn't already been initialized.
if (uh_state.inited) {
uh_log('D', "event.c", 11, "uh_register_event is not permitted after uh_init : %d", app_id);
}
// Perform sanity-checking on the application ID.
if (app_id >= 8) {
uh_log('D', "event.c", 14, "wrong app_id %d", app_id);
}
// Save the arguments into the `uh_state` global variable.
uh_state.cmd_evtable[app_id] = cmd_array;
uh_state.cmd_checkers[app_id] = cmd_checker;
uh_state.cmd_counts[app_id] = cmd_count;
uh_state.cmd_flags[app_ip] = flag;
uh_log('L', "event.c", 21, "app_id:%d, %d events and flag(%d) has registered", app_id, cmd_count, flag);
// The "command checker" is optional.
if (cmd_checker) {
uh_log('L', "event.c", 24, "app_id:%d, cmd checker enforced", app_id);
}
return 0;
}
根据内核来源,只定义了 3 个应用程序,尽管 uH 在技术上最多支持 8 个。
▸ include/linux/uh.h
#define APP_INIT 0
#define APP_SAMPLE 1
#define APP_RKP 2#define UH_PREFIX UL(0xc300c000)
#define UH_APPID(APP_ID) ((UL(APP_ID) & UL(0xFF)) | UH_PREFIX)
enum __UH_APP_ID {
UH_APP_INIT = UH_APPID(APP_INIT),
UH_APP_SAMPLE = UH_APPID(APP_SAMPLE),
UH_APP_RKP = UH_APPID(APP_RKP),
};
然后uh_init
打电话给system_init
并apps_init
。这些函数调用相应应用的命令处理程序 #0:所有其他已注册应用程序的system_init
和apps_init
。在我们的例子中,它最终将分别调用 init_cmd_init
和 rkp_cmd_init
。APP_INIT
uint64_t system_init() {
// ... memset(&saved_regs, 0, sizeof(saved_regs));
// Call the command handler #0 of APP_INIT.
res = uh_handle_command(0, 0, &saved_regs);
if (res) {
uh_log('D', "main.c", 380, "system init failed %d", res);
}
return res;
}
uint64_t apps_init() {
// ... memset(&saved_regs, 0, sizeof(saved_regs));
// Iterate on all applications but APP_INIT.
for (i = 1; i != 8; ++i) {
// Ensure the application is registered.
if (uh_state.cmd_evtable[i]) {
uh_log('W', "main.c", 393, "[+] dst %d initialized", i);
// Call the command handler #0 of the application.
res = uh_handle_command(i, 0, &saved_regs);
if (res) {
uh_log('D', "main.c", 396, "app init failed %d", res);
}
}
}
return res;
}
uh_handle_command
打印应用 ID、命令 ID 及其参数(如果设置了调试标志),调用命令检查器函数(如果有),然后调用相应的命令处理程序。
int64_t uh_handle_command(uint64_t app_id, uint64_t cmd_id, saved_regs_t* regs) {
// ... // If debug is enabled, log the command to be handled.
if ((uh_state.cmd_flags[app_id] & 1) != 0) {
uh_log('L', "main.c", 441, "event received %lx %lx %lx %lx %lx %lx", app_id, cmd_id, regs->x2, regs->x3, regs->x4,
regs->x5);
}
// If a "command checker" is registered for the application, call it.
cmd_checker = uh_state.cmd_checkers[app_id];
if (cmd_id && cmd_checker && cmd_checker(cmd_id)) {
uh_log('E', "main.c", 448, "cmd check failed %d %d", app_id, cmd_id);
return -1;
}
// Perform sanity-checking on the application ID.
if (app_id >= 8) {
uh_log('D', "main.c", 453, "wrong dst %d", app_id);
}
// Ensure the destination application is registered.
if (!uh_state.cmd_evtable[app_id]) {
uh_log('D', "main.c", 456, "dst %d evtable is NULL\n", app_id);
}
// Perform sanity-checking on the command ID.
if (cmd_id >= uh_state.cmd_counts[app_id]) {
uh_log('D', "main.c", 459, "wrong type %lx %lx", app_id, cmd_id);
}
// Get the actual command handler.
cmd_handler = uh_state.cmd_evtable[app_id][cmd_id];
if (!cmd_handler) {
uh_log('D', "main.c", 464, "no handler %lx %lx", app_id, cmd_id);
return -1;
}
// And finally, call it.
return cmd_handler(regs);
}
然后uh_init
打电话给uh_init_bigdata
并uh_init_context
。
uh_init_bigdata
分配分析功能使用的缓冲区并将其清零。它还使大数据区域可以在 EL2 页表中以读/写方式访问。
int64_t uh_init_bigdata() {
// Allocate a buffer to store the analytics collected.
if (!bigdata_state) {
bigdata_state = malloc(0x230, 0);
}
// Reset this buffer and the bigdata global state.
memset(0x870ffc40, 0, 960);
memset(bigdata_state, 0, 560);
// Map this buffer into the hypervisor as writable.
return s1_map(0x870ff000, 0x1000, UNKN3 | WRITE | READ);
}
uh_init_context
分配并清零一个缓冲区,该缓冲区用于在平台重置时存储虚拟机监控程序寄存器(我们不知道它在哪里使用,可能由监视器在某个事件中恢复虚拟机监控程序状态)。
int64_t* uh_init_context() {
// ... // Allocate a buffer to store the processor context.
uh_context = malloc(0x1000, 0);
if (!uh_context) {
uh_log('W', "RKP_1cae4f3b", 21, "%s RKP_148c665c", "uh_init_context");
}
// Reset this buffer.
return memset(uh_context, 0, 0x1000);
}
uh_init
调用以初始化结构中的 memlist,该结构将包含动态分配器可以使用的内存区域,然后调用 pa_restrict_init
函数。memlist_init
dynamic_regions
uh_state
pa_restrict_init
初始化 memlist,其中包含应保护的关键虚拟机监控程序内存区域,并将虚拟机监控程序内存区域添加到其中。它还检查结构是否应按原样包含在内存列表中。protected_ranges
rkp_cmd_counts
protected_ranges
int64_t pa_restrict_init() {
// Initialize the memlist of protected ranges.
memlist_init(&protected_ranges);
// Add the uH memory region to it (containing the hypervisor code and data).
protected_ranges_add(0x87000000, 0x200000);
// Sanity-check: it must contain the `rkp_cmd_counts` array.
if (!protected_ranges_contains(&rkp_cmd_counts)) {
uh_log('D', "pa_restrict.c", 79, "Error, cmd_cnt not within protected range, cmd_cnt addr : %lx", rkp_cmd_counts);
}
// Sanity-check: it must also contain itself.
if (!protected_ranges_contains(&protected_ranges)) {
uh_log('D', "pa_restrict.c", 84, "Error protect_ranges not within protected range, protect_ranges addr : %lx",
&protected_ranges);
}
return uh_log('L', "pa_restrict.c", 87, "[+] uH PA Restrict Init");
}
uh_init
返回到 main
,然后调用 vmm_init
以初始化 EL1 处的虚拟内存管理系统。
vmm_init
将寄存器设置为异常向量,该向量包含要调用的虚拟机管理程序函数以处理异常,并允许在 EL1 处捕获对虚拟内存控制寄存器的访问。VBAR_EL2
int64_t vmm_init() {
// ... uh_log('L', "vmm.c", 142, ">>vmm_init<<");
cs_init(&stru_870355E8);
cs_init(&panic_cs);
// Set the vector table of the hypervisor.
set_vbar_el2(&vmm_vector_table);
// HCR_EL2, Hypervisor Configuration Register.
//
// TVM, bit [26] = 1: EL1 write accesses to the specified EL1 virtual memory control registers are trapped to EL2.
hcr_el2 = get_hcr_el2() | 0x4000000;
uh_log('L', "vmm.c", 161, "RKP_398bc59b %x", hcr_el2);
set_hcr_el2(hcr_el2);
return 0;
}
然后,uh_init
将寄存器设置为页表,这些表将用于 EL1 的第二阶段地址转换。这些是将内核 IPA 转换为实际 PA 的页表。最后,在返回之前,uh_init
s2_enable
打电话。VTTBR_EL2
s2_enable
配置地址转换的第二阶段并启用它。
void s2_enable() {
// ... cs_init(&s2_lock);
// VTCR_EL2, Virtualization Translation Control Register.
//
// - T0SZ, bits [5:0] = 24: VTTBR_EL2 region size is 2^40.
// - SL0, bits [7:6] = 0b01: Stage 2 translation lookup start at level 1.
// - IRGN0, bits [9:8] = 0b11: Normal memory, Outer & Inner Write-Back Read-Allocate No Write-Allocate Cacheable.
// - ORGN0, bits [11:10] = 0b11: Normal memory, Outer & Inner Write-Back Read-Allocate No Write-Allocate Cacheable.
// - SH0, bits [13:12] = 0b11: Inner Shareable.
// - TG0, bits [15:14] = 0b00: Granule size is 4KB.
// - PS, bits [18:16] = 0b010: PA size is 40 bits, 1TB.
set_vtcr_el2(get_vtcr_el2() & 0xfff80000 | 0x23f58);
invalidate_entire_s1_s2_el1_tlb();
// HCR_EL2, Hypervisor Configuration Register.
//
// VM, bit [0] = 1: EL1&0 stage 2 address translation enabled.
set_hcr_el2(get_hcr_el2() | 1);
lock_start = 1;
}
应用初始化¶
我们提到,uh_init
为每个已注册的应用程序调用命令 #0。让我们看看正在对使用的两个应用程序执行什么:和 .APP_INIT
APP_RKP
APP_INIT
¶
注册的命令处理程序包括:APP_INIT
命令 ID | 命令处理程序 | 最大调用数 |
---|
0x00 | init_cmd_init | - |
0x02 | init_cmd_add_dynamic_region | - |
0x03 | init_cmd_id_0x03 | - |
0x04 | init_cmd_initialize_dynamic_heap | - |
让我们看一下在 uh_init
中调用的命令处理程序 #0。这很简单:它设置了 的字段。此结构包含内核函数的地址,当虚拟机监控程序检测到故障时将调用该函数。fault_handler
uh_state
int64_t init_cmd_init(saved_regs_t* regs) {
// ... // Ensure the fault handler can only be set once.
if (!uh_state.fault_handler && regs->x2) {
// Save the value provided into `uh_state`.
uh_state.fault_handler = rkp_get_pa(regs->x2);
uh_log('L', "main.c", 161, "[*] uH fault handler has been registered");
}
return 0;
}
当 uH 调用此命令时,它不会执行任何操作,因为寄存器(包括 x2)都设置为 0。但是这个命令稍后也会被内核调用,如 中的 rkp_init
函数所示。init/main.c
▸ init/main.c
static void __init rkp_init(void)
{
uh_call(UH_APP_INIT, 0, uh_get_fault_handler(), kimage_voffset, 0, 0);
// ...
}
我们来看一下内核注册的错误处理程序。它来自对 uh_get_fault_handler
的调用,这表明它实际上是 uh_fault_handler
函数。
▸ include/linux/uh_fault_handler.h
u64 uh_get_fault_handler(void)
{
uh_handler_list.uh_handler = (u64) & uh_fault_handler;
return (u64) & uh_handler_list;
}
我们可以在结构的定义中看到,错误处理程序的参数将是结构的一个实例,它包含一些 EL2 系统寄存器的值以及存储在结构中的通用寄存器。uh_handler_list
uh_handler_data
uh_registers
▸ include/linux/uh_fault_handler.h
typedef struct uh_registers {
u64 regs[31];
u64 sp;
u64 pc;
u64 pstate;
} uh_registers_t;typedef struct uh_handler_data{
esr_t esr_el2;
u64 elr_el2;
u64 hcr_el2;
u64 far_el2;
u64 hpfar_el2;
uh_registers_t regs;
} uh_handler_data_t;
typedef struct uh_handler_list{
u64 uh_handler;
uh_handler_data_t uh_handler_data[NR_CPUS];
} uh_handler_list_t;
uh_fault_handler
函数将在调用之前打印有关故障的信息,最后打印。do_mem_abort
panic
▸ init/uh_fault_handler.c
void uh_fault_handler(void)
{
unsigned int cpu;
uh_handler_data_t *uh_handler_data;
u32 exception_class;
unsigned long flags;
struct pt_regs regs; spin_lock_irqsave(&uh_fault_lock, flags);
cpu = smp_processor_id();
uh_handler_data = &uh_handler_list.uh_handler_data[cpu];
exception_class = uh_handler_data->esr_el2.ec;
if (!exception_class_string[exception_class]
|| exception_class > esr_ec_brk_instruction_execution)
exception_class = esr_ec_unknown_reason;
pr_alert("=============uH fault handler logging=============\n");
pr_alert("%s",exception_class_string[exception_class]);
pr_alert("[System registers]\n", cpu);
pr_alert("ESR_EL2: %x\tHCR_EL2: %llx\tHPFAR_EL2: %llx\n",
uh_handler_data->esr_el2.bits,
uh_handler_data->hcr_el2, uh_handler_data->hpfar_el2);
pr_alert("FAR_EL2: %llx\tELR_EL2: %llx\n", uh_handler_data->far_el2,
uh_handler_data->elr_el2);
memset(®s, 0, sizeof(regs));
memcpy(®s, &uh_handler_data->regs, sizeof(uh_handler_data->regs));
do_mem_abort(uh_handler_data->far_el2, (u32)uh_handler_data->esr_el2.bits, ®s);
panic("%s",exception_class_string[exception_class]);
}
另外两个命令在虚拟机监控程序框架的初始化期间使用。它们不是由内核调用的,而是在内核实际加载和执行之前由 S-Boot 调用的。APP_INIT
在 dtb_update
中,S-Boot 将为设备树 Blob (DTB) 中的每个节点调用命令 #2。此调用的参数是内存区域地址及其大小。然后,它将调用命令 #4,其中包含两个指向局部变量的指针,这些变量将由虚拟机管理程序填充为参数。memory
int64_t dtb_update(...) {
// ...
dtb_find_entries(dtb, "memory", j_uh_add_dynamic_region);
sprintf(path, "/reserved-memory");
offset = dtb_get_path_offset(dtb, path);
if (offset < 0) {
dprintf("%s: fail to get path [%s]: %d\n", "dtb_update_reserved_memory", path, offset);
} else {
heap_base = 0;
heap_size = 0;
dtb_add_reserved_memory(dtb, offset, 0x87000000, 0x200000, "el2_code", "el2,uh");
uh_call(0xC300C000, 4, &heap_base, &heap_size, 0, 0);
dtb_add_reserved_memory(dtb, offset, heap_base, heap_size, "el2_earlymem", "el2,uh");
dtb_add_reserved_memory(dtb, offset, 0x80001000, 0x1000, "kaslr", "kernel-kaslr");
if (get_env_var(FORCE_UPLOAD) == 5)
rmem_size = 0x2400000;
else
rmem_size = 0x1700000;
dtb_add_reserved_memory(dtb, offset, 0xC9000000, rmem_size, "sboot", "sboot,rmem");
}
// ...
}int64_t uh_add_dynamic_region(int64_t addr, int64_t size) {
uh_call(0xC300C000, 2, addr, size, 0, 0);
return 0;
}
void uh_call(...) {
asm("hvc #0");
}
命令处理程序 #2(我们命名为 init_cmd_add_dynamic_region
)用于向内存列表添加一系列 DDR 内存,从中将雕刻出 uH 的“动态堆”区域。S-Boot 向虚拟机管理程序指示初始化 DDR 后可以访问哪些物理内存区域。dynamic_regions
int64_t init_cmd_add_dynamic_region(saved_regs_t* regs) {
// ... // Ensure the dynamic heap allocator hasn't already been initialized.
if (uh_state.dynamic_heap_inited || !regs->x2 || !regs->x3) {
return -1;
}
// Add the given memory range to the dynamic regions memlist.
return memlist_add(&uh_state.dynamic_regions, regs->x2, regs->x3);
}
命令处理程序 #4(我们命名为 init_cmd_initialize_dynamic_heap
)用于最终确定动态内存区域列表并从中初始化动态堆分配器。使用上一个命令添加所有 DDR 内存后,S-Boot 会调用它。此函数验证其参数,将内核的起始物理地址设置为最低的 DDR 内存地址,最后调用 initialize_dynamic_heap
。
int64_t init_cmd_initialize_dynamic_heap(saved_regs_t* regs) {
// ... // Ensure the dynamic heap allocator hasn't already been initialized.
if (uh_state.dynamic_heap_inited || !regs->x2 || !regs->x3) {
return -1;
}
// Set the start of kernel physical memory to the lowest DDR address.
PHYS_OFFSET = memlist_get_min_addr(&uh_state.dynamic_regions);
// Ensure the S-Boot pointers are not in hypervisor memory.
base = check_and_convert_kernel_input(regs->x2);
size = check_and_convert_kernel_input(regs->x3);
if (!base || !size) {
uh_log('L', "main.c", 188, "Wrong addr in dynamicheap : base: %p, size: %p", base, size);
return -1;
}
// Initialize the dynamic heap allocator.
return initialize_dynamic_heap(base, size, regs->x4);
}
initialize_dynamic_heap
将首先计算动态堆基址和大小。如果这些值由 S-Boot 提供,则直接使用它们。如果未提供大小,则会自动计算。如果未提供基址,则会自动雕刻合适大小的 DDR 内存区域。然后,该函数调用 ,将所选范围保存到全局变量中,并初始化堆块列表,类似于静态堆分配器。它初始化了三个稀疏映射 、 和 ,我们稍后将详细介绍它们。最后,它初始化 memlist、sparsemap,并分配一个缓冲区以包含内核要使用的只读页面。dynamic_heap_initialize
physmap
ro_bitmap
dbl_bitmap
robuf_regions
robuf
int64_t initialize_dynamic_heap(uint64_t* base, uint64_t* size, uint64_t flag) {
// Ensure the dynamic heap allocator hasn't already been initialized.
if (uh_state.dynamic_heap_inited) {
return -1;
}
// And mark it as initialized.
uh_state.dynamic_heap_inited = 1;
// The dynamic heap size can be provided by S-Boot, or calculated automatically.
if (flag) {
dynamic_heap_size = *size;
} else {
dynamic_heap_size = get_dynamic_heap_size();
}
// The dynamic heap base can be provided by S-Boot. In that case, the range provided is removed from the
// `dynamic_regions` memlist. Otherwise, a range of the requested size is automatically removed from the
// `dynamic_regions` memlist and is returned.
if (*base) {
dynamic_heap_base = *base;
if (memlist_remove(&uh_state.dynamic_regions, dynamic_heap_base, dynamic_heap_size)) {
uh_log('L', "main.c", 281, "[-] Dynamic heap address is not existed in memlist, base : %p", dynamic_heap_base);
return -1;
}
} else {
dynamic_heap_base = memlist_get_region_of_size(&uh_state.dynamic_regions, dynamic_heap_size, 0x200000);
}
// Actually initialize the dynamic heap allocator using the provided or computed base address and size.
dynamic_heap_initialize(dynamic_heap_base, dynamic_heap_size);
uh_log('L', "main.c", 288, "[+] Dynamic heap initialized base: %lx, size: %lx", dynamic_heap_base, dynamic_heap_size);
// Copy the dynamic heap base address and size back to S-Boot.
*base = dynamic_heap_base;
*size = dynamic_heap_size;
// Map the dynamic heap in the second stage at EL1 as writable.
mapped_start = dynamic_heap_base;
if (s2_map(dynamic_heap_base, dynamic_heap_size_0, UNKN1 | WRITE | READ, &mapped_start) < 0) {
uh_log('L', "main.c", 299, "s2_map returned false, start : %p, size : %p", mapped_start, dynamic_heap_size);
return -1;
}
// Create 3 new sparsemaps: `physmap`, `ro_bitmap` and `dbl_bitmap` mapping all the remaining DDR memory. The physmap
// internal entries are also added to the protected ranges as they are critical to the hypervisor security.
sparsemap_init("physmap", &uh_state.phys_map, &uh_state.dynamic_regions, 0x20, 0);
sparsemap_for_all_entries(&uh_state.phys_map, protected_ranges_add);
sparsemap_init("ro_bitmap", &uh_state.ro_bitmap, &uh_state.dynamic_regions, 1, 0);
sparsemap_init("dbl_bitmap", &uh_state.dbl_bitmap, &uh_state.dynamic_regions, 1, 0);
// Create a new memlist that will be used to allocate memory pages for page tables management. This memlist is
// initialized with all the remaining DDR memory.
memlist_init(&uh_state.page_allocator.list);
memlist_add(&uh_state.page_allocator.list, dynamic_heap_base, dynamic_heap_size);
// Create a new sparsemap mapping all the pages from the previous memlist.
sparsemap_init("robuf", &uh_state.page_allocator.map, &uh_state.page_allocator.list, 1, 0);
// Allocates a chunk of memory for the robuf allocator (RO pages for the kernel).
allocate_robuf();
// Unmap all the unused DDR memory that might remain below 0xa00000000.
regions_end_addr = memlist_get_max_addr(&uh_state.dynamic_regions);
if ((regions_end_addr >> 33) <= 4) {
s2_unmap(regions_end_addr, 0xa00000000 - regions_end_addr);
s1_unmap(regions_end_addr, 0xa00000000 - regions_end_addr);
}
return 0;
}
如果 S-Boot 未提供大小,则调用 get_dynamic_heap_size
。它首先计算并设置大小:每 GB DDR 内存 1 MB,外加 6 MB。然后,它计算并返回动态堆大小:每 GB DDR 内存 4 MB,加上 6 MB,四舍五入为 8 MB。robuf
uint64_t get_dynamic_heap_size() {
// ... // Do some housekeeping on the memlist.
memlist_merge_ranges(&uh_state.dynamic_regions);
memlist_dump(&uh_state.dynamic_regions);
// Calculate a first dynamic size, depending on the amount of DDR memory, to be added to a fixed robuf size.
some_size1 = memlist_get_contiguous_gigabytes(&uh_state.dynamic_regions, 0x100000);
set_robuf_size(some_size1 + 0x600000);
// Calculate a second and third dynamic sizes, to be added to the robuf size, to get the dynamic heap size.
some_size2 = memlist_get_contiguous_gigabytes(&uh_state.dynamic_regions, 0x100000);
some_size3 = memlist_get_contiguous_gigabytes(&uh_state.dynamic_regions, 0x200000);
dynamic_heap_size = some_size1 + 0x600000 + some_size2 + some_size3;
// Ceil the dynamic heap size to 0x200000 bytes.
return (dynamic_heap_size + 0x1fffff) & 0xffe00000;
}
allocate_robuf
尝试从不久前初始化的动态堆分配器中分配一个区域。如果无法做到这一点,它将获取分配器中最后一个可用的连续内存块。然后,它使用此内存区域作为参数进行调用。初始化 SparseMap 和页面分配器将使用的所有内容。页面分配器和 “robuf” 区域是 RKP 将用于将只读页面分发给内核的内容(例如,用于数据保护功能)。robuf_size
page_allocator_init
page_allocator_init
int64_t allocate_robuf() {
// ... // Ensure the dynamic heap allocator has been initialized.
if (!uh_state.dynamic_heap_inited) {
uh_log('L', "page_allocator.c", 84, "Dynamic heap needs to be initialized");
return -1;
}
// Ceil the robuf size to the size of a page.
robuf_size = uh_state.page_allocator.robuf_size & 0xfffff000;
// Allocate the robuf from the dynamic heap allocator.
robuf_base = dynamic_heap_alloc(uh_state.page_allocator.robuf_size & 0xfffff000, 0x1000);
// If the allocation failed, use the last memory chunk from the dynamic heap allocator.
if (!robuf_base) {
dynamic_heap_alloc_last_chunk(&robuf_base, &robuf_size);
}
if (!robuf_base) {
uh_log('L', "page_allocator.c", 96, "Robuffer Alloc Fail");
return -1;
}
// Clear the data cache for all robuf addresses.
if (robuf_size) {
offset = 0;
do {
zero_data_cache_page(robuf_base + offset);
offset += 0x1000;
} while (offset < robuf_size);
}
// Finally, initialize the page allocator using the robuf memory region.
return page_allocator_init(&uh_state.page_allocator, robuf_base, robuf_size);
}
APP_RKP
¶
注册的命令处理程序包括:APP_RKP
命令 ID | 命令处理程序 | 最大调用数 |
---|
0x00 | rkp_cmd_init | 0 |
0x01 | rkp_cmd_start | 1 |
0x02 | rkp_cmd_deferred_start | 1 |
0x03 | rkp_cmd_write_pgt1 | - |
0x04 | rkp_cmd_write_pgt2 | - |
0x05 | rkp_cmd_write_pgt3 | - |
0x06 | rkp_cmd_emult_ttbr0 | - |
0x07 | rkp_cmd_emult_ttbr1 | - |
0x08 | rkp_cmd_emult_doresume | - |
0x09 | rkp_cmd_free_pgd | - |
0x0A | rkp_cmd_new_pgd | - |
0x0B | rkp_cmd_kaslr_mem | 0 |
0x0D | rkp_cmd_jopp_init | 1 |
0x0E | rkp_cmd_ropp_init | 1 |
0x0F | rkp_cmd_ropp_save | 0 |
0x10 | rkp_cmd_ropp_reload | - |
0x11 | rkp_cmd_rkp_robuffer_alloc | - |
0x12 | rkp_cmd_rkp_robuffer_free | - |
0x13 | rkp_cmd_get_ro_bitmap | 1 |
0x14 | rkp_cmd_get_dbl_bitmap | 1 |
0x15 | rkp_cmd_get_rkp_get_buffer_bitmap | 1 |
0x17 | rkp_cmd_id_0x17 | - |
0x18 | rkp_cmd_set_sctlr_el1 | - |
0x19 | rkp_cmd_set_tcr_el1 | - |
0x1A | rkp_cmd_set_contextidr_el1 | - |
0x1B | rkp_cmd_id_0x1B | - |
0x20 | rkp_cmd_dynamic_load | - |
0x40 | rkp_cmd_cred_init | 1 |
0x41 | rkp_cmd_assign_ns_size | 1 |
0x42 | rkp_cmd_assign_cred_size | 1 |
0x43 | rkp_cmd_pgd_assign | - |
0x44 | rkp_cmd_cred_set_fp | - |
0x45 | rkp_cmd_cred_set_security | - |
0x46 | rkp_cmd_assign_creds | - |
0x48 | rkp_cmd_ro_free_pages | - |
0x4A | rkp_cmd_prot_dble_map | - |
0x4B | rkp_cmd_mark_ppt | - |
0x4E | rkp_cmd_set_pages_ro_tsec_jar | - |
0x4F | rkp_cmd_set_pages_ro_vfsmnt_jar | - |
0x50 | rkp_cmd_set_pages_ro_cred_jar | - |
0x51 | rkp_cmd_id_0x51 | 1 |
0x52 | rkp_cmd_init_ns | - |
0x53 | rkp_cmd_ns_set_root_sb | - |
0x54 | rkp_cmd_ns_set_flags | - |
0x55 | rkp_cmd_ns_set_data | - |
0x56 | rkp_cmd_ns_set_sys_vfsmnt | 5 |
0x57 | rkp_cmd_id_0x57 | - |
0x60 | rkp_cmd_selinux_initialized | - |
0x81 | rkp_cmd_test_get_par | 0 |
0x82 | rkp_cmd_test_get_wxn | 0 |
0x83 | rkp_cmd_test_ro_range | 0 |
0x84 | rkp_cmd_test_get_va_xn | 0 |
0x85 | rkp_check_vmm_unmapped | 0 |
0x86 | rkp_cmd_test_ro | 0 |
0x87 | rkp_cmd_id_0x87 | 0 |
0x88 | rkp_cmd_check_splintering_point | 0 |
0x89 | rkp_cmd_id_0x89 | 0 |
让我们看一下在 uh_init
中调用的命令处理程序 #0。它只是通过调用函数来初始化每个命令可以调用的最大次数(由“检查器”函数强制执行)。rkp_init_cmd_counts
int64_t rkp_cmd_init() {
// Enable panic when a violation is detected.
rkp_panic_on_violation = 1;
// Initialize the counters of commands executions.
rkp_init_cmd_counts();
cs_init(&rkp_start_lock);
return 0;
}
异常处理¶
虚拟机管理程序的一个重要部分是其异常处理代码。这些函数在各种事件中被调用:内核的内存访问错误、内核执行 HVC 指令时等。可以通过查看寄存器中指定的向量表来找到它们。我们在 vmm_init
中看到向量表位于 。从 ARMv8 规范中,我们知道它具有以下结构:VBAR_EL2
vmm_vector_table
地址 | 异常类型 | 描述 |
---|
+0x000 | 同步 | 带 SP0 的当前 EL |
+0x080 | IRQ/vIRQ |
|
+0x100 | FIQ/vFIQ |
|
+0x180 | SError/vSError |
|
+0x200 | 同步 | 带 SPx 的当前 EL |
+0x280 | IRQ/vIRQ |
|
+0x300 | FIQ/vFIQ |
|
+0x380 | SError/vSError |
|
+0x400 | 同步 | 使用 AArch64 降低 EL |
+0x480 | IRQ/vIRQ |
|
+0x500 | FIQ/vFIQ |
|
+0x580 | SError/vSError |
|
+0x600 | 同步 | 使用 AArch32 降低 EL |
+0x680 | IRQ/vIRQ |
|
+0x700 | FIQ/vFIQ |
|
+0x780 | SError/vSError |
|
我们的设备有一个在 EL64 上执行的 1 位内核,因此应将虚拟机管理程序调用分派给位于 的异常处理程序。但在虚拟机管理程序中,所有异常处理程序最终都会使用不同的参数调用 vmm_dispatch
函数。vmm_vector_table+0x400
void exception_handler(...) {
// ... // Save registers x0 to x30, sp_el1, elr_el2, spsr_el2.
// ...
// Dispatch the exception to the VMM, passing it the exception level and type.
vmm_dispatch(<exc_level>, <exc_type>, ®s);
// Clear the local monitor and return to the caller.
asm("clrex");
asm("eret");
}
已采用的异常的级别和类型将作为参数传递给vmm_dispatch
。对于同步异常,如果返回非零值,它将调用 vmm_synchronous_handler
和 panic。对于所有其他异常类型,它只记录一条消息。
int64_t vmm_dispatch(int64_t level, int64_t type, saved_regs_t* regs) {
// ... // If another core has called `vmm_panic`, panic on this core too.
if (has_panicked) {
vmm_panic(level, type, regs, "panic on another core");
}
// Handle the exception depending on its type.
switch (type) {
case 0x0: /* Synchronous */
// For synchronous exception, call the appropriate handler and panic if handling failed.
if (vmm_synchronous_handler(level, type, regs)) {
vmm_panic(level, type, regs, "syncronous handler failed");
}
break;
case 0x80: /* IRQ/vIRQ */
uh_log('D', "vmm.c", 1132, "RKP_e3b85960");
break;
case 0x100: /* FIQ/vFIQ */
uh_log('D', "vmm.c", 1135, "RKP_6d732e0a");
break;
case 0x180: /* SError/vSError */
uh_log('D', "vmm.c", 1149, "RKP_3c71de0a");
break;
default:
return 0;
}
return 0;
}
vmm_synchronous_handler
首先通过读取寄存器来获取异常类:ESR_EL2
对于 HVC 指令执行,它调用 uh_handle_command
将其分派给相应的应用程序命令处理程序;
对于被困的系统寄存器访问,它允许决定是否允许写入,然后根据函数的返回值恢复执行或崩溃;other_msr_mrs_system
对于从内核中止的指令,将跳过特定的中止或错误地址为零的中止,否则,所有其他中止都会导致崩溃;
对于从内核中止的数据,它首先检查这是否是写入内核页表,如果是这种情况,它将调用与目标页表级别对应的函数。对于级别 3 的转换错误,如果地址可以成功转换(使用 或 )。对于权限错误,将忽略该错误,如果地址可以成功转换(使用 ),则刷新 TLB。将跳过错误地址为零的中止,所有其他中止都会导致崩溃。rkp_lxpgt_write
AT S12E1R
AT S12E1W
AT S12E1W
int64_t vmm_synchronous_handler(int64_t level, int64_t type, saved_regs_t* regs) {
// ... // ESR_EL2, Exception Syndrome Register (EL2).
//
// EC, bits [31:26]: Indicates the reason for the exception that this register holds information about.
esr_el2 = get_esr_el2();
switch (esr_el2 >> 26) {
case 0x12: /* HVC instruction execution in AArch32 state */
case 0x16: /* HVC instruction execution in AArch64 state */
// For HVC instruction execution, check if the HVC ID starts with 0xc300cXXX.
if ((regs->x0 & 0xfffff000) == 0xc300c000) {
app_id = regs->x0;
cmd_id = regs->x1;
// Reset the injection value for the current CPU.
cpu_num = get_current_cpu();
if (cpu_num <= 7) {
uh_state.injections[cpu_num] = 0;
}
// Dispatch the call to the application command handler.
uh_handle_command(app_id, cmd_id, regs);
}
return 0;
case 0x18: /* Trapped MSR, MRS or Sys. ins. execution in AArch64 state */
// For trapped system register accesses, first ensure that it is a write. If that's the case, call a handler to
// decide whether the operation is allowed or not.
//
// The handler gets the value that was being written to the system register from the saved general registers.
// Depending on which system register is being written, it will check if specific bits have a fixed value. If the
// write operation is allowed, ELR_EL2 is updated to make it point to the next instruction. If the operation is
// denied, the hypervisor will panic.
//
// - Direction, bit [0] = 0: Write access, including MSR instructions.
// - Op0/Op2/Op1/CRn/Rt/CRm, bits[21:1]: Values from the issued instruction.
if ((esr_el2 & 1) == 0 && !other_msr_mrs_system(®s->x0, esr_el2_1 & 0x1ffffff)) {
return 0;
}
vmm_panic(level, type, regs, "other_msr_mrs_system failure");
return 0;
case 0x20: /* Instruction Abort from a lower EL */
// ...
// For instruction aborts coming from a lower EL, if the bits patterns below all match and the number of
// instruction aborts skipped is less than 9, then the number is incremented and the abort is skipped.
//
// - IFSC, bits [5:0] = 0b000111: Translation fault, level 3.
// - S1PTW, bit [7] = 0b1: Fault on the stage 2 translation of an access for a stage 1 translation table walk.
// - EA, bit [9] = 0b0: Not an External Abort.
// - FnV, bit [10] = 0b0: FAR is valid.
// - SET, bits [12:11] = 0b00: Recoverable state (UER).
if (should_skip_prefetch_abort() == 1) {
return 0;
}
// If the faulting address is 0, the fault is injected back to be handled by EL1 and the injection value is set
// for the current CPU. Otherwise, the hypervisor panics.
if (!esr_ec_prefetch_abort_from_a_lower_exception_level("-snip-")) {
print_vmm_registers(regs);
return 0;
}
vmm_panic(level, type, regs, "esr_ec_prefetch_abort_from_a_lower_exception_level");
return 0;
case 0x21: /* Instruction Abort taken without a change in EL */
// For instruction aborts taken without a change in EL, meaning hypervisor faults, it panics.
uh_log('L', "vmm.c", 920, "esr abort iss: 0x%x", esr_el2 & 0x1ffffff);
vmm_panic(level, type, regs, "esr_ec_prefetch_abort_taken_without_a_change_in_exception_level");
case 0x24: /* Data Abort from a lower EL */
// For data aborts coming from a lower EL, it first calls `rkp_fault` to try to detect page table writes. That is
// when the faulting instruction is in the kernel text and is a `str x2, [x1]`. In addition, the x1 register must
// point to a page table entry. Then, depending on the page table level, it calls a different function:
//
// - rkp_l1pgt_write for level 1 PTs.
// - rkp_l2pgt_write for level 2 PTs.
// - rkp_l3pgt_write for level 3 PTs.
//
// If the kernel page table write is allowed, the PC is advanced to the next instruction.
if (!rkp_fault(regs)) {
return 0;
}
// For translation faults at level 3, convert the faulting IPA into a kernel VA. Then call the `el1_va_to_pa`
// function that will use the AT S12E1R/W instruction to translate it to a PA, as if the access was coming from
// EL1. If the address can be translated successfully, we return immediately.
//
// DFSC, bits [5:0] = 0b000111: Translation fault, level 3.
if ((esr_el2 & 0x3f) == 0b000111) {
// HPFAR_EL2, Hypervisor IPA Fault Address Register.
//
// Holds the faulting IPA for some aborts on a stage 2 translation taken to EL2.
va = rkp_get_va(get_hpfar_el2() << 8);
cs_enter(&s2_lock);
// el1_va_to_pa returns 0 if the address can be translated.
res = el1_va_to_pa(va, &ipa);
if (!res) {
uh_log('L', "vmm.c", 994, "Skipped data abort va: %p, ipa: %p", va, ipa);
cs_exit(&s2_lock);
return 0;
}
cs_exit(&s2_lock);
}
// For permission faults at any level, convert the faulting IPA into a kernel VA. Then use the AT S12E1W
// instruction to translate it to a PA, as if the access was coming from EL1. If the address can be translated
// successfully, invalidate the TLBs and return immediately.
//
// - WnR, bit [6] = 0b1: Abort caused by an instruction writing to a memory location.
// - DFSC, bits [5:0] = 0b0011xx: Permission fault, any level.
if ((esr_el2 & 0x7c) == 0x4c) {
va = rkp_get_va(get_hpfar_el2() << 8);
at_s12e1w(va);
// PAR_EL1, Physical Address Register.
//
// F, bit [0] = 0: Successful address translation.
if ((get_par_el1() & 1) == 0) {
print_el2_state();
invalidate_entire_s1_s2_el1_tlb();
return 0;
}
}
// ...
// For all other aborts, call the same function as the other instruction aborts.
if (esr_ec_prefetch_abort_from_a_lower_exception_level("-snip-")) {
vmm_panic(level, type, regs, "esr_ec_data_abort_from_a_lower_exception_level");
} else {
print_vmm_registers(regs);
}
return 0;
case 0x25: /* Data Abort taken without a change in EL */
// For data aborts taken without a change in EL, meaning hypervisor faults, it panics.
vmm_panic(level, type, regs, "esr_ec_data_abort_taken_without_a_change_in_exception_level");
return 0;
default:
return -1;
}
}
当虚拟机监控程序需要崩溃时调用的 vmm_panic
函数首先记录崩溃消息、异常级别和类型。如果 MMU 被禁用或异常不是同步的或从 EL2 获取的,则它会调用 uh_panic
。否则,它将调用 uh_panic_el1
。
crit_sec_t* vmm_panic(int64_t level, int64_t type, saved_regs_t* regs, char* message) {
// ... uh_log('L', "vmm.c", 1171, ">>vmm_panic<<");
cs_enter(&panic_cs);
// Print the panic message.
uh_log('L', "vmm.c", 1175, "message: %s", message);
// Print the exception level.
switch (level) {
case 0x0:
uh_log('L', "vmm.c", 1179, "level: VMM_EXCEPTION_LEVEL_TAKEN_FROM_CURRENT_WITH_SP_EL0");
break;
case 0x200:
uh_log('L', "vmm.c", 1182, "level: VMM_EXCEPTION_LEVEL_TAKEN_FROM_CURRENT_WITH_SP_ELX");
break;
case 0x400:
uh_log('L', "vmm.c", 1185, "level: VMM_EXCEPTION_LEVEL_TAKEN_FROM_LOWER_USING_AARCH64");
break;
case 0x600:
uh_log('L', "vmm.c", 1188, "level: VMM_EXCEPTION_LEVEL_TAKEN_FROM_LOWER_USING_AARCH32");
break;
default:
uh_log('L', "vmm.c", 1191, "level: VMM_UNKNOWN\n");
break;
}
// Print the exception type.
switch (type) {
case 0x0:
uh_log('L', "vmm.c", 1197, "type: VMM_EXCEPTION_TYPE_SYNCHRONOUS");
break;
case 0x80:
uh_log('L', "vmm.c", 1200, "type: VMM_EXCEPTION_TYPE_IRQ_OR_VIRQ");
break;
case 0x100:
uh_log('L', "vmm.c", 1203, "type: VMM_SYSCALL\n");
break;
case 0x180:
uh_log('L', "vmm.c", 1206, "type: VMM_EXCEPTION_TYPE_SERROR_OR_VSERROR");
break;
default:
uh_log('L', "vmm.c", 1209, "type: VMM_UNKNOWN\n");
break;
}
print_vmm_registers(regs);
// SCTLR_EL1, System Control Register (EL1).
//
// M, bit [0] = 0b0: EL1&0 stage 1 address translation disabled.
if ((get_sctlr_el1() & 1) == 0 || type != 0 /* Synchronous */ ||
(level == 0 /* Current EL with SP0 */ || level == 0x200 /* Current EL with SP0 */)) {
has_panicked = 1;
cs_exit(&panic_cs);
// Reset the device immediately if the panic originated from another core.
if (!strcmp(message, "panic on another core")) {
exynos_reset(0x8800);
}
// Call `uh_panic` which will ultimately reset the device.
uh_panic();
}
// Call `uh_panic_el1` which will execute the registered kernel fault handler.
uh_panic_el1(uh_state.fault_handler, regs);
return cs_exit(&panic_cs);
}
uh_panic
调用print_state_and_reset
,记录 EL1 和 EL2 系统寄存器值,以及虚拟机管理程序和内核堆栈内容。它将这些文本版本复制到“大数据”区域,然后重新启动设备。
void uh_panic() {
uh_log('L', "main.c", 482, "uh panic!");
print_state_and_reset();
}
void print_state_and_reset() {
// Print debug values.
uh_log('L', "panic.c", 29, "count state - page_ro: %lx, page_free: %lx, s2_breakdown: %lx", page_ro, page_free,
s2_breakdown);
// Print EL2 system registers values.
print_el2_state();
// Print EL1 system registers values.
print_el1_state();
// Print the content of the hypervisor and kernel stacks.
print_stack_contents();
// Store this information for the analytics system.
bigdata_store_data();
// Reset the device.
has_panicked = 1;
exynos_reset(0x8800);
}
uh_panic_el1
用系统和通用寄存器值填充了我们之前看到的结构。然后,它设置为内核错误处理程序,以便在执行指令时调用它。uh_handler_data
ELR_EL2
ERET
int64_t uh_panic_el1(uh_handler_list_t* fault_handler, saved_regs_t* regs) {
// ... // Ensure that a kernel fault handler is registered.
uh_log('L', "vmm.c", 111, ">>uh_panic_el1<<");
if (!fault_handler) {
uh_log('L', "vmm.c", 113, "uH handler did not registered");
uh_panic();
}
// Print EL2 system registers values.
print_el2_state();
// Print EL1 system registers values.
print_el1_state();
// Print the content of the hypervisor and kernel stacks.
print_stack_contents();
// Set the injection value for the current CPU, unless it has already been set, in which case it panics.
cpu_num = get_current_cpu();
if (cpu_num <= 7) {
something = cpu_num - 0x21530000;
if (uh_state.injections[cpu_num] == something) {
uh_log('D', "vmm.c", 99, "Injection locked");
}
uh_state.injections[cpu_num] = something;
}
// Fill the `uh_handler_data` structure with the registers values.
handler_data = &fault_handler->uh_handler_data[cpu_num];
handler_data->esr_el2 = get_esr_el2();
handler_data->elr_el2 = get_elr_el2();
handler_data->hcr_el2 = get_hcr_el2();
handler_data->far_el2 = get_far_el2();
handler_data->hpfar_el2 = get_hpfar_el2() << 8;
if (regs) {
memcpy(fault_handler->uh_handler_data[cpu_num].regs.regs, regs, 272);
}
// Finally, set ELR_EL2 to the kernel fault handler to execute it on exception return.
set_elr_el2(fault_handler->uh_handler);
return 0;
}
现在我们已经了解了虚拟机管理程序是如何初始化的,以及如何处理异常,让我们看看如何启动特定于 RKP 的部分。
启动¶
RKP 启动使用两个不同的命令分两个阶段执行:
RKP 开始¶
在内核端,第一个与启动相关的命令是在 rkp_init
中调用的。
▸ init/main.c
static void __init rkp_init(void)
{
// ...
rkp_init_data.vmalloc_end = (u64)high_memory;
rkp_init_data.init_mm_pgd = (u64)__pa(swapper_pg_dir);
rkp_init_data.id_map_pgd = (u64)__pa(idmap_pg_dir);
rkp_init_data.tramp_pgd = (u64)__pa(tramp_pg_dir);
#ifdef CONFIG_UH_RKP_FIMC_CHECK
rkp_init_data.no_fimc_verify = 1;
#endif
rkp_init_data.tramp_valias = (u64)TRAMP_VALIAS;
rkp_init_data.zero_pg_addr = (u64)__pa(empty_zero_page);
// ...
uh_call(UH_APP_RKP, RKP_START, (u64)&rkp_init_data, (u64)kimage_voffset, 0, 0);
}
此函数填充提供给虚拟机监控程序的类型的数据结构。它包含有关内核内存布局的信息。rkp_init_t
▸ init/main.c
rkp_init_t rkp_init_data __rkp_ro = {
.magic = RKP_INIT_MAGIC,
.vmalloc_start = VMALLOC_START,
.no_fimc_verify = 0,
.fimc_phys_addr = 0,
._text = (u64)_text,
._etext = (u64)_etext,
._srodata = (u64)__start_rodata,
._erodata = (u64)__end_rodata,
.large_memory = 0,
};
rkp_init
函数在内核引导过程的早期被调用。start_kernel
▸ init/main.c
asmlinkage __visible void __init start_kernel(void)
{
// ...
rkp_init();
// ...
}
在虚拟机管理程序端,命令处理程序只是确保它不能被调用两次,并在获得适当的锁后调用rkp_start
。
int64_t rkp_cmd_start(saved_regs_t* regs) {
// ... cs_enter(&rkp_start_lock);
// Make sure RKP is not already started.
if (rkp_inited) {
cs_exit(&rkp_start_lock);
uh_log('L', "rkp.c", 133, "RKP is already started");
return -1;
}
// Call the actual startup function.
res = rkp_start(regs);
cs_exit(&rkp_start_lock);
return res;
}
rkp_start
函数将有关内核内存布局的所有信息保存到全局变量中。它初始化了两个内存,其中包含所有内核可执行区域(包括内核文本),并用于“动态可执行文件加载”功能,本博客文章中不会详细介绍该功能。它还通过调用 rkp_paging_init
函数来保护内核部分,并通过调用 rkp_l1pgt_process_table
来处理用户页表。executable_regions
dynamic_load_regions
int64_t rkp_start(saved_regs_t* regs) {
// ... // Save the offset between the kernel virtual and physical mappings into `KIMAGE_VOFFSET`.
KIMAGE_VOFFSET = regs->x3;
// Convert the address of the `rkp_init_data` structure from a VA to a PA using `rkp_get_pa`.
rkp_init_data = rkp_get_pa(regs->x2);
// Check the magic value.
if (rkp_init_data->magic - 0x5afe0001 >= 2) {
uh_log('L', "rkp_init.c", 85, "RKP INIT-Bad Magic(%d), %p", regs->x2, rkp_init_data);
return -1;
}
// If it is the test magic value, call `rkp_init_cmd_counts_test` which allows test commands 0x81-0x88 to be called an
// unlimited number of times.
if (rkp_init_data->magic == 0x5afe0002) {
rkp_init_cmd_counts_test();
rkp_test = 1;
}
// Saves the various fields of `rkp_init_data` into global variables.
INIT_MM_PGD = rkp_init_data->init_mm_pgd;
ID_MAP_PGD = rkp_init_data->id_map_pgd;
ZERO_PG_ADDR = rkp_init_data->zero_pg_addr;
TRAMP_PGD = rkp_init_data->tramp_pgd;
TRAMP_VALIAS = rkp_init_data->tramp_valias;
VMALLOC_START = rkp_init_data->vmalloc_start;
VMALLOC_END = rkp_init_data->vmalloc_end;
TEXT = rkp_init_data->_text;
ETEXT = rkp_init_data->_etext;
TEXT_PA = rkp_get_pa(TEXT);
ETEXT_PA = rkp_get_pa(ETEXT);
SRODATA = rkp_init_data->_srodata;
ERODATA = rkp_init_data->_erodata;
TRAMP_PGD_PAGE = TRAMP_PGD & 0xfffffffff000;
INIT_MM_PGD_PAGE = INIT_MM_PGD & 0xfffffffff000;
LARGE_MEMORY = rkp_init_data->large_memory;
page_ro = 0;
page_free = 0;
s2_breakdown = 0;
pmd_allocated_by_rkp = 0;
NO_FIMC_VERIFY = rkp_init_data->no_fimc_verify;
if (rkp_bitmap_init() < 0) {
uh_log('L', "rkp_init.c", 150, "Failed to init bitmap");
return -1;
}
// Create a new memlist to contain the list of kernel executable regions.
memlist_init(&executable_regions);
memlist_set_unkn_14(&executable_regions);
// Add the kernel text to the newly created memlist.
memlist_add(&executable_regions, TEXT, ETEXT - TEXT);
// Add the `TRAMP_VALIAS` page to the newly created memlist.
if (TRAMP_VALIAS) {
memlist_add(&executable_regions, TRAMP_VALIAS, 0x1000);
}
// Create a new memlist of dynamically loaded executable regions.
memlist_init(&dynamic_load_regions);
memlist_set_unkn_14(&dynamic_load_regions);
// Call a function that makes the static heap acquire all the unused dynamic memory.
put_all_dynamic_heap_chunks_in_static_heap();
// Map and protect various kernel regions in the second stage at EL1, and at EL2.
if (rkp_paging_init() < 0) {
uh_log('L', "rkp_init.c", 169, "rkp_pging_init fails");
return -1;
}
// Mark RKP as initialized.
rkp_inited = 1;
// Call a function that will process the user page tables.
if (rkp_l1pgt_process_table(get_ttbr0_el1() & 0xfffffffff000, 0, 1) < 0) {
uh_log('L', "rkp_init.c", 179, "processing l1pgt fails");
return -1;
}
// Log EL2 system registers values.
uh_log('L', "rkp_init.c", 183, "[*] HCR_EL2: %lx, SCTLR_EL2: %lx", get_hcr_el2(), get_sctlr_el2());
uh_log('L', "rkp_init.c", 184, "[*] VTTBR_EL2: %lx, TTBR0_EL2: %lx", get_vttbr_el2(), get_ttbr0_el2());
uh_log('L', "rkp_init.c", 185, "[*] MAIR_EL1: %lx, MAIR_EL2: %lx", get_mair_el1(), get_mair_el2());
uh_log('L', "rkp_init.c", 186, "RKP Activated");
return 0;
}
rkp_set_kernel_rox
函数将内核文本区域标记为 ,并使其在虚拟机管理程序中只读。从虚拟机管理程序中可以写,而从内核中可以只读可执行。内核文本是可执行的,日志区域和动态堆区域在内核中是只读的。TEXT
phys_map
swapper_pg_dir
empty_zero_page
int64_t rkp_paging_init() {
// ... // Ensure the start of the kernel text is page-aligned.
if (!TEXT || (TEXT & 0xfff) != 0) {
uh_log('L', "rkp_paging.c", 637, "kernel text start is not aligned, stext : %p", TEXT);
return -1;
}
// Ensure the end of the kernel text is page-aligned.
if (!ETEXT || (ETEXT & 0xfff) != 0) {
uh_log('L', "rkp_paging.c", 642, "kernel text end is not aligned, etext : %p", ETEXT);
return -1;
}
// Ensure the kernel text section doesn't contain the base address.
if (TEXT_PA <= get_base() && ETEXT_PA > get_base()) {
return -1;
}
// Unmap the hypervisor memory from the second stage (to make it inaccessible to the kernel).
if (s2_unmap(0x87000000, 0x200000)) {
return -1;
}
// Set the kernel text section as `TEXT` in the physmap.
if (rkp_phys_map_set_region(TEXT_PA, ETEXT - TEXT, TEXT) < 0) {
uh_log('L', "rkp_paging.c", 435, "physmap set failed for kernel text");
return -1;
}
// Set the kernel text section as read-only from the hypervisor.
if (s1_map(TEXT_PA, ETEXT - TEXT, UNKN1 | READ)) {
uh_log('L', "rkp_paging.c", 447, "Failed to make VMM S1 range RO");
return -1;
}
// Ensure the `swapper_pg_dir` is not contained within the kernel text section.
if (INIT_MM_PGD >= TEXT_PA && INIT_MM_PGD < ETEXT_PA) {
uh_log('L', "rkp_paging.c", 454, "failed to make swapper_pg_dir RW");
return -1;
}
// Set the `swapper_pg_dir` as writable from the hypervisor.
if (s1_map(INIT_MM_PGD, 0x1000, UNKN1 | WRITE | READ)) {
uh_log('L', "rkp_paging.c", 454, "failed to make swapper_pg_dir RW");
return -1;
}
rkp_phys_map_lock(ZERO_PG_ADDR);
// Set the `empty_zero_page` as read-only executable in the second stage.
if (rkp_s2_page_change_permission(ZERO_PG_ADDR, 0 /* read-write */, 1 /* executable */, 1) < 0) {
uh_log('L', "rkp_paging.c", 462, "Failed to make executable for empty_zero_page");
return -1;
}
rkp_phys_map_unlock(ZERO_PG_ADDR);
// Make the kernel text section executable for the kernel (note the 0 given as argument).
if (rkp_set_kernel_rox(0 /* read-write */)) {
return -1;
}
// Set the log region read-only in the second stage.
if (rkp_s2_range_change_permission(0x87100000, 0x87140000, 0x80 /* read-only */, 1 /* executable */, 1) < 0) {
uh_log('L', "rkp_paging.c", 667, "Failed to make UH_LOG region RO");
return -1;
}
// Ensure the dynamic heap has been initialized.
if (!uh_state.dynamic_heap_inited) {
return 0;
}
// Set the dynamic heap region as read-only in the second stage.
if (rkp_s2_range_change_permission(uh_state.dynamic_heap_base,
uh_state.dynamic_heap_base + uh_state.dynamic_heap_size, 0x80 /* read-only */,
1 /* executable */, 1) < 0) {
uh_log('L', "rkp_paging.c", 685, "Failed to make dynamic_heap region RO");
return -1;
}
return 0;
}
rkp_set_kernel_rox
函数使内核 text 和 rodata 部分在第二阶段可执行,并且根据参数的不同,可写或只读。首次调用函数时,参数为 0,但稍后会使用 0x80 再次调用该函数。它还更新了 将内核 rodata 部分页面标记为只读(这与实际页表不同)。access
ro_bitmap
int64_t rkp_set_kernel_rox(int64_t access) {
// ... // Set the kernel text and rodata sections as executable.
erodata_pa = rkp_get_pa(ERODATA);
if (rkp_s2_range_change_permission(TEXT_PA, erodata_pa, access, 1 /* executable */, 1) < 0) {
uh_log('L', "rkp_paging.c", 392, "Failed to make Kernel range ROX");
return -1;
}
// If the kernel text and rodata sections are read-only in the second stage, return here.
if (access) {
return 0;
}
// Ensure the end of the kernel text and rodata sections are page-aligned.
if (((erodata_pa | ETEXT_PA) & 0xfff) != 0) {
uh_log('L', "rkp_paging.c", 158, "start or end addr is not aligned, %p - %p", ETEXT_PA, erodata_pa);
return 0;
}
// Ensure the end of the kernel text is before the end of the rodata section.
if (ETEXT_PA > erodata_pa) {
uh_log('L', "rkp_paging.c", 163, "start addr is bigger than end addr %p, %p", ETEXT_PA, erodata_pa);
return 0;
}
// Mark all the pages belonging to the kernel rodata as read-only in the `ro_bitmap`.
paddr = ETEXT_PA;
while (sparsemap_set_value_addr(&uh_state.ro_bitmap, addr, 1) >= 0) {
paddr += 0x1000;
if (paddr >= erodata_pa) {
return 0;
}
uh_log('L', "rkp_paging.c", 171, "set_pgt_bitmap fail, %p", paddr);
}
return 0;
}
我们提到,在rkp_paging_init
之后,rkp_start
还会调用 rkp_l1pgt_process_table
来处理页表。我们稍后将详细介绍此函数的内部工作原理,但它是使用寄存器的值调用的,并且主要使其 3 级表为只读。TTBR0_EL1
RKP 延迟启动¶
在内核端,第二个与启动相关的命令在 rkp_deferred_init
中调用。
▸ include/linux/rkp.h
static inline void rkp_deferred_init(void){
uh_call(UH_APP_RKP, RKP_DEFERRED_START, 0, 0, 0, 0);
}
rkp_deferred_init
本身由 调用,这是内核引导过程的稍后部分。kernel_init
▸ init/main.c
static int __ref kernel_init(void *unused)
{
// ...
rkp_deferred_init();
// ...
}
在虚拟机管理程序端,命令处理程序只需调用 rkp_deferred_start
。它在第二阶段将内核文本部分设置为只读。它还使用 rkp_l1pgt_process_table
函数处理两个内核页表和 。rkp_cmd_deferred_start
swapper_pg_dir
tramp_pg_dir
int64_t rkp_deferred_start() {
uh_log('L', "rkp_init.c", 193, "DEFERRED INIT START");
// Set the kernel text section as read-only in the second stage (here the argument is 0x80).
if (rkp_set_kernel_rox(0x80 /* read-only */)) {
return -1;
}
// Call a function that will process the `swapper_pg_dir` kernel page tables.
if (rkp_l1pgt_process_table(INIT_MM_PGD, 0x1ffffff, 1) < 0) {
uh_log('L', "rkp_init.c", 198, "Failed to make l1pgt processing");
return -1;
}
// Call a function that will process the `tramp_pg_dir` kernel page tables.
if (TRAMP_PGD && rkp_l1pgt_process_table(TRAMP_PGD, 0x1ffffff, 1) < 0) {
uh_log('L', "rkp_init.c", 204, "Failed to make l1pgt processing");
return -1;
}
// Mark RKP as deferred initialized.
rkp_deferred_inited = 1;
uh_log('L', "rkp_init.c", 217, "DEFERRED INIT IS DONE\n");
memory_fini();
return 0;
}
RKP 位图¶
通过挖掘内核源码,我们可以再找到 3 个内核在启动时调用的命令。
其中两个仍然在rkp_init
中调用:
▸ init/main.c
static void __init rkp_init(void)
{
// ...
rkp_s_bitmap_ro = (sparse_bitmap_for_kernel_t *)
uh_call(UH_APP_RKP, RKP_GET_RO_BITMAP, 0, 0, 0, 0);
rkp_s_bitmap_dbl = (sparse_bitmap_for_kernel_t *)
uh_call(UH_APP_RKP, RKP_GET_DBL_BITMAP, 0, 0, 0, 0);
// ...
}
这两个命令,并将 的实例作为参数。RKP_GET_RO_BITMAP
RKP_GET_DBL_BITMAP
sparse_bitmap_for_kernel
▸ include/linux/rkp.h
typedef struct sparse_bitmap_for_kernel {
u64 start_addr;
u64 end_addr;
u64 maxn;
char **map;
} sparse_bitmap_for_kernel_t;
这些实例分别是 rkp_s_bitmap_ro
和 rkp_s_bitmap_dbl
。
▸ init/main.c
sparse_bitmap_for_kernel_t* rkp_s_bitmap_ro __rkp_ro = 0;
▸ init/main.c
sparse_bitmap_for_kernel_t* rkp_s_bitmap_dbl __rkp_ro = 0;
它们分别对应于虚拟机管理程序和稀疏映射。ro_bitmap
dbl_bitmap
第一个用于使用 rkp_is_pg_protected
函数检查虚拟机监控程序是否已将页面设置为只读。
▸ include/linux/rkp.h
static inline u8 rkp_is_pg_protected(u64 va){
return rkp_check_bitmap(__pa(va), rkp_s_bitmap_ro);
}
第二个用于使用 rkp_is_pg_dbl_mapped
函数检查页面是否已映射,并且不应再次映射。
▸ include/linux/rkp.h
static inline u8 rkp_is_pg_dbl_mapped(u64 pa){
return rkp_check_bitmap(pa, rkp_s_bitmap_dbl);
}
这两个函数都调用 rkp_check_bitmap
,后者从内核位图中提取与给定物理地址对应的位。
▸ include/linux/rkp.h
#define SPARSE_UNIT_BIT (30)
#define SPARSE_UNIT_SIZE (1<<SPARSE_UNIT_BIT)
// ...static inline u8 rkp_check_bitmap(u64 pa, sparse_bitmap_for_kernel_t *kernel_bitmap){
u8 val;
u64 offset, map_loc, bit_offset;
char *map;
if(!kernel_bitmap || !kernel_bitmap->map)
return 0;
offset = pa - kernel_bitmap->start_addr;
map_loc = ((offset % SPARSE_UNIT_SIZE) / PAGE_SIZE) >> 3;
bit_offset = ((offset % SPARSE_UNIT_SIZE) / PAGE_SIZE) % 8;
if(kernel_bitmap->maxn <= (offset >> SPARSE_UNIT_BIT))
return 0;
map = kernel_bitmap->map[(offset >> SPARSE_UNIT_BIT)];
if(!map)
return 0;
val = (u8)((*(u64 *)(&map[map_loc])) >> bit_offset) & ((u64)1);
return val;
}
RKP_GET_RO_BITMAP
并且由虚拟机管理程序进行类似的处理,因此我们只看第一个处理程序。RKP_GET_DBL_BITMAP
rkp_cmd_get_ro_bitmap
从动态堆中分配一个结构,将其归零,并将其传递给 sparsemap_bitmap_kernel
,后者将用 中的信息填充它。然后,它将新分配结构中的 VA 放入 X0 中,如果在 X2 中提供了指针,它也会将 VA 放在那里(使用 virt_to_phys_el1
进行转换)。sparse_bitmap_for_kernel_t
ro_bitmap
int64_t rkp_cmd_get_ro_bitmap(saved_regs_t* regs) {
// ... // This command cannot be called after RKP has been deferred initialized.
if (rkp_deferred_inited) {
return -1;
}
// Allocate the bitmap structure that will be returned to the kernel.
bitmap = dynamic_heap_alloc(0x20, 0);
if (!bitmap) {
uh_log('L', "rkp.c", 302, "Fail alloc robitmap for kernel");
return -1;
}
// Reset the newly allocated structure.
memset(bitmap, 0, sizeof(sparse_bitmap_for_kernel_t));
// Fill the kernel bitmap with the contents of the hypervisor `ro_bitmap`.
res = sparsemap_bitmap_kernel(&uh_state.ro_bitmap, bitmap);
if (res) {
uh_log('L', "rkp.c", 309, "Fail sparse_map_bitmap_kernel");
return res;
}
// Put the kernel bitmap VA in x0.
regs->x0 = rkp_get_va(bitmap);
// Put the kernel bitmap VA in the memory referenced by x2.
if (regs->x2) {
*virt_to_phys_el1(regs->x2) = regs->x0;
}
uh_log('L', "rkp.c", 322, "robitmap:%p", bitmap);
return 0;
}
要了解内核位图是如何从虚拟机监控程序稀疏图填充的,让我们看一下sparsemap_bitmap_kernel
。此函数将所有稀疏映射条目的 PA 转换为 VA,然后再将它们复制到结构中。sparse_bitmap_for_kernel_t
int64_t sparsemap_bitmap_kernel(sparsemap_t* map, sparse_bitmap_for_kernel_t* kernel_bitmap) {
// ... // Sanity-check the arguments.
if (!map || !kernel_bitmap) {
return -1;
}
// Copy the start address, end address, and entries unchanged.
kernel_bitmap->start_addr = map->start_addr;
kernel_bitmap->end_addr = map->end_addr;
kernel_bitmap->maxn = map->count;
// Allocate from the dynamic heap an array to hold the entries addresses.
bitmaps = dynamic_heap_alloc(8 * map->count, 0);
if (!bitmaps) {
uh_log('L', "sparsemap.c", 202, "kernel_bitmap does not allocated : %lu", map->count);
return -1;
}
// Private sparsemaps are not allowed to be accessed by the kernel.
if (map->private) {
uh_log('L', "sparsemap.c", 206, "EL1 doesn't support to get private sparsemap");
return -1;
}
// Zero out the allocated memory.
memset(bitmaps, 0, 8 * map->count);
// Save the VA of the allocated array.
kernel_bitmap->map = (bitmaps - PHYS_OFFSET) | 0xffffffc000000000;
index = 0;
do {
// Store the VAs of the entries into the array.
bitmap = map->entries[index].bitmap;
if (bitmap) {
bitmaps[index] = (bitmap - PHYS_OFFSET) | 0xffffffc000000000;
}
++index;
} while (index < kernel_bitmap->maxn);
return 0;
}
第三个命令是 ,它由内核在 rkp_robuffer_init
中调用。RKP_GET_RKP_GET_BUFFER_BITMAP
▸ init/main.c
static void __init rkp_robuffer_init(void)
{
rkp_s_bitmap_buffer = (sparse_bitmap_for_kernel_t *)
uh_call(UH_APP_RKP, RKP_GET_RKP_GET_BUFFER_BITMAP, 0, 0, 0, 0);
}
它还用于检索稀疏映射,这次是 .page_allocator.map
▸ init/main.c
sparse_bitmap_for_kernel_t* rkp_s_bitmap_buffer __rkp_ro = 0;
它用于使用 is_rkp_ro_page
函数检查页面是否来自虚拟机管理程序的页面分配器。
▸ include/linux/rkp.h
static inline unsigned int is_rkp_ro_page(u64 va){
return rkp_check_bitmap(__pa(va), rkp_s_bitmap_buffer);
}
用于检索 sparsemap 的 3 个命令都是从函数调用的。start_kernel
▸ init/main.c
asmlinkage __visible void __init start_kernel(void)
{
// ...
rkp_robuffer_init();
// ...
rkp_init();
// ...
}
总结一下,内核使用这些位图来检查某些数据是否位于受 RKP 保护的页面上。如果是这种情况,内核知道它需要调用其中一个 RKP 命令来修改它。
页表处理¶
当我们在 rkp_start
和 rkp_deferred_start
中看到对 rkp_l1pgt_process_table
的调用时,我们暂时搁置了一下,但现在是时候详细说明虚拟机管理程序如何处理内核页表了。但首先,快速提醒一下内核页面表的布局。
以下是 Android 上的 Linux 内存布局(使用 4 KB 页面 + 3 个级别):
Start End Size Use
-----------------------------------------------------------------------
0000000000000000 0000007fffffffff 512GB user
ffffff8000000000 ffffffffffffffff 512GB kernel
下面是对应的转换表查找:
+--------+--------+--------+--------+--------+--------+--------+--------+
|63 56|55 48|47 40|39 32|31 24|23 16|15 8|7 0|
+--------+--------+--------+--------+--------+--------+--------+--------+
| | | | | |
| | | | | v
| | | | | [11:0] in-page offset
| | | | +-> [20:12] L3 index (PTE)
| | | +-----------> [29:21] L2 index (PMD)
| | +---------------------> [38:30] L1 index (PUD)
| +-------------------------------> [47:39] L0 index (PGD)
+-------------------------------------------------> [63] TTBR0/1
因此,请记住,在本节中,我们有 PGD = PUD = VA[38:30],因为我们只使用了 3 个级别的 AT。
以下是 0 级、1 级和 2 级描述符(可以是无效、块或表描述符)的格式:
以下是 3 级描述符(可以是无效描述符或页面描述符)的格式:
第一级¶
第一级表(或 PGD)的处理由 rkp_l1pgt_process_table
函数完成。内核 PGD 必须是 或 ,除非我们在延迟初始化之前。此函数也从不处理用户 PGD。swapper_pg_dir
tramp_pg_dir
idmap_pg_dir
如果引入 PGD,则在 physmap 中将其标记为 physmap,并在第二阶段设置为只读。如果 PGD 正在停用,则在 physmap 中将其标记为可写,并在第二阶段中使其可写。L1
FREE
最后,处理 PGD 的描述符:表描述符被传递给 rkp_l2pgt_process_table
函数,如果这是用户 PGD,则设置其位,而块描述符则设置其位,而不考虑 PGD 类型。PXN
PXN
int64_t rkp_l1pgt_process_table(int64_t pgd, uint32_t high_bits, uint32_t is_alloc) {
// ... // If this is a kernel PGD.
if (high_bits == 0x1ffffff) {
// It should be either `swapper_pg_dir` or `tramp_pg_dir`, or RKP should not be deferred initialized.
if (pgd != INIT_MM_PGD && (!TRAMP_PGD || pgd != TRAMP_PGD) || rkp_deferred_inited) {
// If it is not, we trigger a policy violation that results in a panic.
rkp_policy_violation("only allowed on kerenl PGD or tramp PDG! l1t : %lx", pgd);
return -1;
}
} else {
// If it is a user PGD and it is `idmap_pg_dir`, return without procesing it.
if (ID_MAP_PGD == pgd) {
return 0;
}
}
rkp_phys_map_lock(pgd);
// If we are introducing this PGD.
if (is_alloc) {
// If it is already marked as a PGD in the physmap, return without processing it.
if (is_phys_map_l1(pgd)) {
rkp_phys_map_unlock(pgd);
return 0;
}
// Compute the correct type (`KERNEL` or not).
if (high_bits) {
type = KERNEL | L1;
} else {
type = L1;
}
// And mark the PGD as such in the physmap.
res = rkp_phys_map_set(pgd, type);
if (res < 0) {
rkp_phys_map_unlock(pgd);
return res;
}
// Make the PGD read-only in the second stage.
res = rkp_s2_page_change_permission(pgd, 0x80 /* read-only */, 0 /* non-executable */, 0);
if (res < 0) {
uh_log('L', "rkp_l1pgt.c", 63, "Process l1t failed, l1t addr : %lx, op : %d", pgd, 1);
rkp_phys_map_unlock(pgd);
return res;
}
}
// If we are retiring this PGD.
else {
// If it is not marked as a PGD in the physmap, return without processing it.
if (!is_phys_map_l1(pgd)) {
rkp_phys_map_unlock(pgd);
return 0;
}
// Mark the PGD as `FREE` in the physmap.
res = rkp_phys_map_set(pgd, FREE);
if (res < 0) {
rkp_phys_map_unlock(pgd);
return res;
}
// Make the PGD writable in the second stage.
res = rkp_s2_page_change_permission(pgd, 0 /* writable */, 1 /* executable */, 0);
if (res < 0) {
uh_log('L', "rkp_l1pgt.c", 80, "Process l1t failed, l1t addr : %lx, op : %d", pgd, 0);
rkp_phys_map_unlock(pgd);
return res;
}
}
// Now iterate over each descriptor of the PGD.
offset = 0;
entry = 0;
start_addr = high_bits << 39;
do {
desc_p = pgd + entry;
desc = *desc_p;
// Block descriptor (not a table, not invalid).
if ((desc & 0b11) != 0b11) {
if (desc) {
// Make the memory non executable at EL1.
set_pxn_bit_of_desc(desc_p, 1);
}
}
// Table descriptor.
else {
addr = start_addr & 0xffffff803fffffff | offset;
// Call rkp_l2pgt_process_table to process the PMD.
res += rkp_l2pgt_process_table(desc & 0xfffffffff000, addr, is_alloc);
// Make the memory non executable at EL1 for user PGDs.
if (!(start_addr >> 39)) {
set_pxn_bit_of_desc(desc_p, 1);
}
}
entry += 8;
offset += 0x40000000;
start_addr = addr;
} while (entry != 0x1000);
rkp_phys_map_unlock(pgd);
return res;
}
第二级¶
二级表(或 PMD)的处理由 rkp_l2pgt_process_table
函数完成。如果未从虚拟机管理程序页面分配器分配分配给此功能的第一个用户 PMD,则将不再处理用户 PMD。
如果引入 PMD,则在 physmap 中将其标记为只读,并在第二阶段设置为只读。内核 PMD 永远不允许停用。如果用户 PMD 正在停用,则会在 physmap 中将其标记为可写,并在第二阶段中使其可写。L2
FREE
最后,对 PMD 的描述符进行处理:所有描述符都传递给 check_single_l2e
函数。
int64_t rkp_l2pgt_process_table(int64_t pmd, uint64_t start_addr, uint32_t is_alloc) {
// ... // If this is a user PMD.
if (!(start_addr >> 39)) {
// The first time this function is called, determine if the PMD was allocated by the hypervisor page allocator. The
// default value of `pmd_allocated_by_rkp` is 0, 1 means "process the PMD", -1 means "don't process it".
if (!pmd_allocated_by_rkp) {
if (page_allocator_is_allocated(pmd) == 1) {
pmd_allocated_by_rkp = 1;
} else {
pmd_allocated_by_rkp = -1;
}
}
// If the PMD was not allocated by RKP, return without processing it.
if (pmd_allocated_by_rkp == -1) {
return 0;
}
}
rkp_phys_map_lock(pmd);
// If we are introducing this PMD.
if (is_alloc) {
// If it is not marked as a PMD in the physmap, return without processing it.
if (is_phys_map_l2(pmd)) {
rkp_phys_map_unlock(pmd);
return 0;
}
// Compute the correct type (`KERNEL` or not).
if (start_addr >> 39) {
type = KERNEL | L2;
} else {
type = L2;
}
// And mark the PMD as such in the physmap.
res = rkp_phys_map_set(pmd, (start_addr >> 23) & 0xff80 | type);
if (res < 0) {
rkp_phys_map_unlock(pmd);
return res;
}
// Make the PMD read-only in the second stage.
res = rkp_s2_page_change_permission(pmd, 0x80 /* read-only */, 0 /* non-executable */, 0);
if (res < 0) {
uh_log('L', "rkp_l2pgt.c", 98, "Process l2t failed, %lx, %d", pmd, 1);
rkp_phys_map_unlock(pmd);
return res;
}
}
// If we are retiring this PMD.
else {
// If it is not marked as a PMD in the physmap, return without processing it.
if (!is_phys_map_l2(pmd)) {
rkp_phys_map_unlock(pgd);
return 0;
}
// Kernel PMDs are not allowed to be retired.
if (start_addr >= 0xffffff8000000000) {
rkp_policy_violation("Never allow free kernel page table %lx", pmd);
}
// Also check that it is not marked `KERNEL` in the physmap.
if (is_phys_map_kernel(pmd)) {
rkp_policy_violation("Entry must not point to kernel page table %lx", pmd);
}
// Mark the PMD as `FREE` in the physmap.
res = rkp_phys_map_set(pmd, FREE);
if (res < 0) {
rkp_phys_map_unlock(pgd);
return 0;
}
// Make the PMD writable in the second stage.
res = rkp_s2_page_change_permission(pmd, 0 /* writable */, 1 /* executable */, 0);
if (res < 0) {
uh_log('L', "rkp_l2pgt.c", 123, "Process l2t failed, %lx, %d", pmd, 0);
rkp_phys_map_unlock(pgd);
return 0;
}
}
// Now iterate over each descriptor of the PMD.
offset = 0;
for (i = 0; i != 0x1000; i += 8) {
addr = offset | start_addr & 0xffffffffc01fffff;
// Call `check_single_l2e` on each descriptor.
res += check_single_l2e(pmd + i, addr, is_alloc);
offset += 0x200000;
}
rkp_phys_map_unlock(pgd);
return res;
}
check_single_l2e
处理每个 PMD 描述符。如果描述符映射的是可执行的 VA,则不允许停用 PMD。如果引入它,则虚拟机管理程序将保护下一级表。如果 VA 不可执行,则设置描述符的位。PXN
如果描述符是块描述符,则不执行进一步的处理。但是,如果它是表描述符,则调用 rkp_l3pgt_process_table
函数来处理下一级表。
int64_t check_single_l2e(int64_t* desc_p, uint64_t start_addr, signed int32_t is_alloc) {
// ... // If the virtual address mapped by this descriptor is executable (it is in the `executable_regions` memlist).
if (executable_regions_contains(start_addr, 2)) {
// The PMD is not allowed to be retired, trigger a policy violation.
if (!is_alloc) {
uh_log('L', "rkp_l2pgt.c", 36, "RKP_61acb13b %lx, %lx", desc_p, *desc_p);
uh_log('L', "rkp_l2pgt.c", 37, "RKP_4083e222 %lx, %d, %d", start_addr, (start_addr >> 30) & 0x1ff,
(start_addr >> 21) & 0x1ff);
rkp_policy_violation("RKP_d60f7274");
}
// The PMD is being allocated, set the protect flag (to protect the next level table).
protect = 1;
} else {
// The virtual address is not executable, set the PXN bit of the descriptor.
set_pxn_bit_of_desc(desc_p, 2);
// Unset the protect flag (we don't need to protect the next level table).
protect = 0;
}
// Get the descriptor type.
desc = *desc_p;
type = *desc & 0b11;
// Block descriptor, return without processing it.
if (type == 0b01) {
return 0;
}
// Invalid descriptor, return without processing it.
if (type != 0b11) {
if (desc) {
uh_log('L', "rkp_l2pgt.c", 64, "Invalid l2e %p %p %p", desc, is_alloc, desc_p);
}
return 0;
}
// Table descriptor, log if the PT needs to be protected.
if (protect) {
uh_log('L', "rkp_l2pgt.c", 56, "L3 table to be protected, %lx, %d, %d", desc, (start_addr >> 21) & 0x1ff,
(start_addr >> 30) & 0x1ff);
}
// If the kernel PMD is being retired, log as well.
if (!is_alloc && start_addr >= 0xffffff8000000000) {
uh_log('L', "rkp_l2pgt.c", 58, "l2 table FREE-1 %lx, %d, %d", *desc_p, (start_addr >> 30) & 0x1ff,
(start_addr >> 21) & 0x1ff);
uh_log('L', "rkp_l2pgt.c", 59, "l2 table FREE-2 %lx, %d, %d", desc_p, 0x1ffffff, 0);
}
// Call rkp_l3pgt_process_table to process the PT.
return rkp_l3pgt_process_table(*desc_p & 0xfffffffff000, start_addr, is_alloc, protect);
}
第三级¶
第三级表(或 PT)的处理由 rkp_l3pgt_process_table
函数完成。如果 PT 映射内核文本,则内核文本开始的 PTE 将保存到全局变量中。如果不需要保护 PT,则函数返回而不进行任何处理。stext_ptep
如果引入 PT,则在 physmap 中将其标记为 pitmap,并在第二阶段设置为只读。处理 PT 的描述符:无效的描述符触发冲突,映射不可执行 VA 的描述符设置了其位。L3
PXN
如果 PT 正在停用,则会在 physmap 中将其标记为 pt,并触发冲突。如果冲突没有崩溃(尽管它应该在初始化后,因为设置了),则 PT 在第二阶段是可写的。处理 PT 的描述符:无效描述符触发冲突,映射可执行 VA 的描述符触发冲突。FREE
rkp_panic_on_violation
int64_t rkp_l3pgt_process_table(int64_t pte, uint64_t start_addr, uint32_t is_alloc, int32_t protect) {
// ... cs_enter(&l3pgt_lock);
// If `stext_ptep` hasn't been set already, and this PT maps the kernel text (i.e. the first virtual address mapped
// and the kernel text have the same PGD, PUD, PMD indexes), then set `stext_ptep` to the PTE of the kernel text
// start.
if (!stext_ptep && ((TEXT ^ start_addr) & 0x7fffe00000) == 0) {
stext_ptep = pte + 8 * ((TEXT >> 12) & 0x1ff);
uh_log('L', "rkp_l3pgt.c", 74, "set stext ptep %lx", stext_ptep);
}
cs_exit(&l3pgt_lock);
// If we don't need to protect this PT, return without processing it.
if (!protect) {
return 0;
}
rkp_phys_map_lock(pte);
// If we are introducing this PT.
if (is_alloc) {
// If it is not marked as a PT in the physmap, return without processing it.
if (is_phys_map_l3(pte)) {
uh_log('L', "rkp_l3pgt.c", 87, "Process l3t SKIP %lx, %d, %d", pte, 1, start_addr >> 39);
rkp_phys_map_unlock(pte);
return 0;
}
// Compute the correct type (`KERNEL` or not).
if (start_addr >> 39) {
type = KERNEL | L3;
} else {
type = L3;
}
// And mark the PT as such in the physmap.
res = rkp_phys_map_set(pte, type);
if (res < 0) {
rkp_phys_map_unlock(pte);
return res;
}
// Make the PT read-only in the second stage.
res = rkp_s2_page_change_permission(pte, 0x80 /* read-only */, 0 /* non-executable */, 0);
if (res < 0) {
uh_log('L', "rkp_l3pgt.c", 102, "Process l3t failed %lx, %d", pte, 1);
rkp_phys_map_unlock(pte);
return res;
}
// Now iterate over each descriptor of the PT.
offset = 0;
desc_p = pte;
do {
addr = offset | start_addr & 0xffffffffffe00fff;
if (addr >> 39) {
desc = *desc_p;
if (desc) {
// Invalid descriptor, trigger a violation.
if ((desc & 0b11) != 0b11) {
rkp_policy_violation("Invalid l3e, %lx, %lx, %d", desc, desc_p, 1);
}
// Page descriptor, if the virtual address mapped by this descriptor is not executable, then set the PXN bit.
if (!executable_regions_contains(addr, 3)) {
set_pxn_bit_of_desc(desc_p, 3);
}
}
} else {
uh_log('L', "rkp_l3pgt.c", 37, "L3t not kernel range, %lx, %d, %d", desc_p, (addr >> 30) & 0x1ff,
(addr >> 21) & 0x1ff);
}
offset += 0x1000;
++desc_p;
} while (offset != 0x200000);
}
// If we are retiring this PT.
else {
// If it is not marked as a PT in the physmap, return without processing it.
if (!is_phys_map_l3(pte)) {
uh_log('L', "rkp_l3pgt.c", 110, "Process l3t SKIP, %lx, %d, %d", pte, 0, start_addr >> 39);
rkp_phys_map_unlock(pte);
return 0;
}
// Mark the PT as `FREE` in the physmap.
res = rkp_phys_map_set(pte, FREE);
if (res < 0) {
rkp_phys_map_unlock(pte);
return res;
}
// Protected PTs are not allowed to be retired, so trigger a violation. If we did not panic, continue.
rkp_policy_violation("Free l3t not allowed, %lx, %d, %d", pte, 0, start_addr >> 39);
// Make the PT writable in the second stage.
res = rkp_s2_page_change_permission(pte, 0 /* writable */, 1 /* executable */, 0);
if (res < 0) {
uh_log('L', "rkp_l3pgt.c", 127, "Process l3t failed, %lx, %d", pte, 0);
rkp_phys_map_unlock(pte);
return res;
}
// Now iterate over each descriptor of the PT.
offset = 0;
desc_p = pte;
do {
addr = offset | start_addr & 0xffffffffffe00fff;
if (addr >> 39) {
desc = *desc_p;
if (desc) {
// Invalid descriptor, trigger a violation.
if ((desc & 0b11) != 0b11) {
rkp_policy_violation("Invalid l3e, %lx, %lx, %d", *desc, desc_p, 0);
}
// Page descriptor, if the virtual address mapped by this descriptor is executable, trigger a violation.
if (executable_regions_contains(addr, 3)) {
rkp_policy_violation("RKP_b5438cb1");
}
}
} else {
uh_log('L', "rkp_l3pgt.c", 37, "L3t not kernel range, %lx, %d, %d", desc_p, (addr >> 30) & 0x1ff,
(addr >> 21) & 0x1ff);
}
offset += 0x1000;
++desc_p;
} while (offset != 0x200000);
}
rkp_phys_map_unlock(pte);
return 0;
}
如果处理内核页表的函数发现它们认为违反策略的内容,则它们会使用将冲突描述为参数的字符串调用rkp_policy_violation
。此函数记录消息并调用 uh_panic
if 是否设置。rkp_panic_on_violation
int64_t rkp_policy_violation(const char* message, ...) {
// ... // Log the violation message and its arguments.
res = rkp_log(0x4c, "rkp.c", 108, message, /* variable arguments */);
// Panic if panic on violation is enabled.
if (rkp_panic_on_violation) {
uh_panic();
}
return res;
}
rkp_log
是将当前时间和 CPU 编号添加到消息的 uh_log
的包装器。它还调用将格式化的消息复制到分析或大数据区域。bigdata_store_rkp_string
启动后的整体状态¶
本部分用作启动(正常和延迟)完成后整体状态的参考。我们将介绍 RKP 的每个内部结构,以及虚拟机管理程序控制的页表,并详细说明它们的内容以及添加或删除它的位置。
忆力表dynamic_regions
¶
忆力表protected_ranges
¶
忆力表page_allocator.list
¶
忆力表executable_regions
¶
忆力表dynamic_load_regions
¶
Sparsemap(基于physmap
dynamic_regions
)¶
在 init_cmd_initialize_dynamic_heap
中初始化
TEXT
-ETEXT
设置为rkp_paging_init
TEXT
PGD () 设置为 rkp_l1pgt_process_table
TTBR0_EL1
L1
PMDs () 设置为 rkp_l2pgt_process_table
TTBR0_EL1
L2
PTE () 设置为 VA 在 rkp_l3pgt_process_table
中的 IN 位置TTBR0_EL1
L3
executable_regions
PGD () 设置为 rkp_l1pgt_process_table
swapper|tramp_pg_dir
KERNEL|L1
PMDs () 设置为 rkp_l2pgt_process_table
swapper|tramp_pg_dir
KERNEL|L2
PTE () 设置为 VA 在 rkp_l3pgt_process_table
中的 IN 位置swapper|tramp_pg_dir
KERNEL|L3
executable_regions
(值在rkp_lxpgt_process_table
)
(值在set_range_to_pxn|rox_l3
)
(值在rkp_set_pages_ro
中已更改,rkp_ro_free_pages
)
Sparsemap(基于ro_bitmap
dynamic_regions
)¶
在 init_cmd_initialize_dynamic_heap
中初始化
ETEXT
-ERODATA
设置为rkp_set_kernel_rox
1
(值在rkp_s2_page_change_permission
)
(值在rkp_s2_range_change_permission
)
Sparsemap(基于dbl_bitmap
dynamic_regions
)¶
Sparsemap / (基于robuf
page_allocator.map
dynamic_regions
)¶
在 init_cmd_initialize_dynamic_heap
中初始化
(值在page_allocator_alloc_page
)
(值在page_allocator_free_page
)
EL2 阶段 1 的页表¶
uH 区域是从初始页表映射而来的
0x87100000 0x87140000(日志区域)映射的 RW 在 memory_init
0x870FF000-0x87100000(大数据区域)在 uh_init_bigdata
中映射了 RW
S-Boot 添加的区域结束,0xA00000000 init_cmd_initialize_dynamic_heap
中取消映射
TEXT
-ETEXT
rkp_paging_init
中映射的 RO
swapper_pg_dir
页面映射的 RW 在 rkp_paging_init
中
(此列表不包括启动后的更改)
EL1 阶段 2 的页表¶
动态堆区域映射 init_cmd_initialize_dynamic_heap
中的 RW
S-Boot 添加的区域结束,0xA00000000 init_cmd_initialize_dynamic_heap
中取消映射
0x87000000-0x87200000(uH 区域)在 rkp_paging_init
中未映射
empty_zero_page
页面在 rkp_paging_init
中映射为 RWX
TEXT
-ERODATA
在 rkp_set_kernel_rox
中映射为 RWX(来自 rkp_paging_init
)
0x87100000-0x87140000(对数区域)映射了 rkp_paging_init
中的 ROX
动态堆区域映射 rkp_paging_init
中的 ROX
PGD()在rkp_l1pgt_process_table
中映射为ROTTBR0_EL1
PMD()在rkp_l2pgt_process_table
中映射为ROTTBR0_EL1
PTE () 映射为 RO,其中 VA 在 rkp_l3pgt_process_table
TTBR0_EL1
executable_regions
TEXT
-ERODATA
在 rkp_set_kernel_rox
中映射为 ROX(来自 rkp_deferred_start
)
PGD()在rkp_l1pgt_process_table
中映射为ROswapper|tramp_pg_dir
PMD()在rkp_l2pgt_process_table
中映射为ROswapper|tramp_pg_dir
PTE () 映射为 RO,其中 VA 在 rkp_l3pgt_process_table
swapper|tramp_pg_dir
executable_regions
(提醒:此列表不包括启动后的更改)
RKP 和 KDP 命令¶
在前面的章节中,我们已经了解了 RKP 如何设法完全控制内核页表,以及它在处理它们时会做什么。现在,我们将看到它如何用于保护关键的内核数据,主要是通过将其分配给只读页面并要求 HVC 对其进行修改。
保护内核数据¶
全局变量¶
所有需要由 RKP 保护的全局变量在内核源代码中都用 __rkp_ro
或 __kdp_ro
进行注释。这些宏分别将全局变量移动到 和 部分。.rkp_ro
kdp_ro
▸ include/linux/linkage.h
#ifdef CONFIG_UH_RKP
#define __page_aligned_rkp_bss __section(.rkp_bss.page_aligned) __aligned(PAGE_SIZE)
#define __rkp_ro __section(.rkp_ro)
// ...
#endif
▸ include/linux/linkage.h
#ifdef CONFIG_RKP_KDP
#define __kdp_ro __section(.kdp_ro)
#define __lsm_ro_after_init_kdp __section(.kdp_ro)
// ...
#endif
这些部分是内核部分的一部分,在 rkp_set_kernel_rox
的第二阶段中变为只读。.rodata
▸ include/asm-generic/vmlinux.lds.h
#define RO_DATA_SECTION(align)
// ...
.rkp_ro : AT(ADDR(.rkp_ro) - LOAD_OFFSET) { \
VMLINUX_SYMBOL(__start_rkp_ro) = .; \
*(.rkp_ro) \
VMLINUX_SYMBOL(__stop_rkp_ro) = .; \
VMLINUX_SYMBOL(__start_kdp_ro) = .; \
*(.kdp_ro) \
VMLINUX_SYMBOL(__stop_kdp_ro) = .; \
VMLINUX_SYMBOL(__start_rkp_ro_pgt) = .; \
RKP_RO_PGT \
VMLINUX_SYMBOL(__stop_rkp_ro_pgt) = .; \
} \
下面是以这种方式保护的所有全局变量的列表。
▸ arch/arm64/mm/mmu.c
unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)] __page_aligned_rkp_bss;
▸ arch/arm64/mm/mmu.c
static pte_t bm_pte[PTRS_PER_PTE] __page_aligned_rkp_bss;
static pmd_t bm_pmd[PTRS_PER_PMD] __page_aligned_rkp_bss __maybe_unused;
static pud_t bm_pud[PTRS_PER_PUD] __page_aligned_rkp_bss __maybe_unused;
▸ fs/namespace.c
struct super_block *sys_sb __kdp_ro = NULL;
struct super_block *odm_sb __kdp_ro = NULL;
struct super_block *vendor_sb __kdp_ro = NULL;
struct super_block *art_sb __kdp_ro = NULL;
struct super_block *rootfs_sb __kdp_ro = NULL;
▸ init/main.c
int is_recovery __kdp_ro = 0;
▸ init/main.c
rkp_init_t rkp_init_data __rkp_ro = { /* ... */ };
▸ init/main.c
sparse_bitmap_for_kernel_t* rkp_s_bitmap_ro __rkp_ro = 0;
sparse_bitmap_for_kernel_t* rkp_s_bitmap_dbl __rkp_ro = 0;
sparse_bitmap_for_kernel_t* rkp_s_bitmap_buffer __rkp_ro = 0;
▸ init/main.c
int __check_verifiedboot __kdp_ro = 0;
▸ kernel/cred.c
int rkp_cred_enable __kdp_ro = 0;
▸ kernel/cred.c
struct cred init_cred __kdp_ro = { /* ... */ };
▸ security/selinux/hooks.c
struct task_security_struct init_sec __kdp_ro;
▸ security/selinux/hooks.c
int selinux_enforcing __kdp_ro;
▸ security/selinux/hooks.c
int selinux_enabled __kdp_ro = 1;
▸ security/selinux/hooks.c
static struct security_hook_list selinux_hooks[] __lsm_ro_after_init_kdp = { /* ... */ };
▸ security/selinux/ss/services.c
int ss_initialized __kdp_ro;
继续请看下部分