RISC-Y Business: Raging against the reduced machine

Abstract

In recent years the interest in obfuscation has increased, mainly because people want to protect their intellectual property. Unfortunately, most of what’s been written is focused on the theoretical aspects. In this article, we will discuss the practical engineering challenges of developing a low-footprint virtual machine interpreter. The VM is easily embeddable, built on open-source technology and has various hardening features that were achieved with minimal effort.

Introduction

In addition to protecting intellectual property, a minimal virtual machine can be useful for other reasons. You might want to have an embeddable interpreter to execute business logic (shellcode), without having to deal with RWX memory. It can also be useful as an educational tool, or just for fun.

Creating a custom VM architecture (similar to VMProtect/Themida) means that we would have to deal with binary rewriting/lifting or write our own compiler. Instead, we decided to use a preexisting architecture, which would be supported by LLVM: RISC-V. This architecture is already widely used for educational purposes and has the advantage of being very simple to understand and implement.

Initially, the main contender was WebAssembly. However, existing interpreters were very bloated and would also require dealing with a binary format. Additionally, it looks like WASM64 is very underdeveloped and our memory model requires 64-bit pointer support. SPARC and PowerPC were also considered, but RISC-V seems to be more popular and there are a lot more resources available for it.

WebAssembly was designed for sandboxing and therefore strictly separates guest and host memory. Because we will be writing our own RISC-V interpreter, we chose to instead share memory between the guest and the host. This means that pointers in the RISC-V execution context (the guest) are valid in the host process and vice-versa.

As a result, the instructions responsible for reading/writing memory can be implemented as a simple memcpy call and we do not need additional code to translate/validate memory accesses (which helps with our goal of small code size). With this property, we need to implement only two system calls to perform arbitrary operations in the host process:

uintptr_t riscvm_get_peb();
uintptr_t riscvm_host_call(uintptr_t rip, uintptr_t args[13]);

The riscvm_get_peb is Windows-specific and it allows us to resolve exports, which we can then pass to the riscvm_host_call function to execute arbitrary code. Additionally, an optional host_syscall stub could be implemented, but this is not strictly necessary since we can just call the functions in ntdll.dll instead.

To keep the interpreter footprint as low as possible, we decided to develop a toolchain that outputs a freestanding binary. The goal is to copy this binary into memory and point the VM’s program counter there to start execution. Because we are in freestanding mode, there is no C runtime available to us, this requires us to handle initialization ourselves.

As an example, we will use the following hello.c file:

int _start() {
    int result = 0;
    for(int i = 0; i < 52; i++) {
        result += *(volatile int*)&i;
    }
    return result + 11;
}

We compile the program with the following incantation:

clang -target riscv64 -march=rv64g -mcmodel=medany -Os -c hello.c -o hello.o

And then verify by disassembling the object:

$ llvm-objdump --disassemble hello.o

hello.o:        file format elf64-littleriscv

0000000000000000 <_start>:
       0: 13 01 01 ff   addi    sp, sp, -16
       4: 13 05 00 00   li      a0, 0
       8: 23 26 01 00   sw      zero, 12(sp)
       c: 93 05 30 03   li      a1, 51

0000000000000010 <.LBB0_1>:
      10: 03 26 c1 00   lw      a2, 12(sp)
      14: 33 05 a6 00   add     a0, a2, a0
      18: 9b 06 16 00   addiw   a3, a2, 1
      1c: 23 26 d1 00   sw      a3, 12(sp)
      20: 63 40 b6 00   blt     a2, a1, 0x20 <.LBB0_1+0x10>
      24: 1b 05 b5 00   addiw   a0, a0, 11
      28: 13 01 01 01   addi    sp, sp, 16
      2c: 67 80 00 00   ret

The hello.o is a regular ELF object file. To get a freestanding binary we need to invoke the linker with a linker script:

ENTRY(_start)

LINK_BASE = 0x8000000;

SECTIONS
{
    . = LINK_BASE;
    __base = .;

    .text : ALIGN(16) {
        . = LINK_BASE;
        *(.text)
        *(.text.*)
    }

    .data : {
        *(.rodata)
        *(.rodata.*)
        *(.data)
        *(.data.*)
        *(.eh_frame)
    }

    .init : {
        __init_array_start = .;
        *(.init_array)
        __init_array_end = .;
    }

    .bss : {
        *(.bss)
        *(.bss.*)
        *(.sbss)
        *(.sbss.*)
    }

    .relocs : {
        . = . + SIZEOF(.bss);
        __relocs_start = .;
    }
}

This script is the result of an excessive amount of swearing and experimentation. The format is .name : { ... } where .name is the destination section and the stuff in the brackets is the content to paste in there. The special . operator is used to refer to the current position in the binary and we define a few special symbols for use by the runtime:

Symbol	Meaning
`__base`	Base of the executable.
`__init_array_start`	Start of the C++ init arrays.
`__init_array_end`	End of the C++ init arrays.
`__relocs_start`	Start of the relocations (end of the binary).

These symbols are declared as extern in the C code and they will be resolved at link-time. While it may seem confusing at first that we have a destination section, it starts to make sense once you realize the linker has to output a regular ELF executable. That ELF executable is then passed to llvm-objcopy to create the freestanding binary blob. This makes debugging a whole lot easier (because we get DWARF symbols) and since we will not implement an ELF loader, it also allows us to extract the relocations for embedding into the final binary.

To link the intermediate ELF executable and then create the freestanding hello.pre.bin:

ld.lld.exe -o hello.elf --oformat=elf -emit-relocs -T ..\lib\linker.ld --Map=hello.map hello.o
llvm-objcopy -O binary hello.elf hello.pre.bin

For debugging purposes we also output hello.map, which tells us exactly where the linker put the code/data:

             VMA              LMA     Size Align Out     In      Symbol
               0                0        0     1 LINK_BASE = 0x8000000
               0                0  8000000     1 . = LINK_BASE
         8000000                0        0     1 __base = .
         8000000          8000000       30    16 .text
         8000000          8000000        0     1         . = LINK_BASE
         8000000          8000000       30     4         hello.o:(.text)
         8000000          8000000       30     1                 _start
         8000010          8000010        0     1                 .LBB0_1
         8000030          8000030        0     1 .init
         8000030          8000030        0     1         __init_array_start = .
         8000030          8000030        0     1         __init_array_end = .
         8000030          8000030        0     1 .relocs
         8000030          8000030        0     1         . = . + SIZEOF ( .bss )
         8000030          8000030        0     1         __relocs_start = .
               0                0       18     8 .rela.text
               0                0       18     8         hello.o:(.rela.text)
               0                0       3b     1 .comment
               0                0       3b     1         <internal>:(.comment)
               0                0       30     1 .riscv.attributes
               0                0       30     1         <internal>:(.riscv.attributes)
               0                0      108     8 .symtab
               0                0      108     8         <internal>:(.symtab)
               0                0       55     1 .shstrtab
               0                0       55     1         <internal>:(.shstrtab)
               0                0       5c     1 .strtab
               0                0       5c     1         <internal>:(.strtab)

The final ingredient of the toolchain is a small Python script (relocs.py) that extracts the relocations from the ELF file and appends them to the end of the hello.pre.bin. The custom relocation format only supports R_RISCV_64 and is resolved by our CRT like so:

typedef struct
{
    uint8_t  type;
    uint32_t offset;
    int64_t  addend;
} __attribute__((packed)) Relocation;

extern uint8_t __base[];
extern uint8_t __relocs_start[];

#define LINK_BASE    0x8000000
#define R_RISCV_NONE 0
#define R_RISCV_64   2

static __attribute((noinline)) void riscvm_relocs()
{
    if (*(uint32_t*)__relocs_start != 'ALER')
    {
        asm volatile("ebreak");
    }

    uintptr_t load_base = (uintptr_t)__base;

    for (Relocation* itr = (Relocation*)(__relocs_start + sizeof(uint32_t)); itr->type != R_RISCV_NONE; itr++)
    {
        if (itr->type == R_RISCV_64)
        {
            uint64_t* ptr = (uint64_t*)((uintptr_t)itr->offset - LINK_BASE + load_base);
            *ptr -= LINK_BASE;
            *ptr += load_base;
        }
        else
        {
            asm volatile("ebreak");
        }
    }
}

As you can see, the __base and __relocs_start magic symbols are used here. The only reason this works is the -mcmodel=medany we used when compiling the object. You can find more details in this article and in the RISC-V ELF Specification. In short, this flag allows the compiler to assume that all code will be emitted in a 2 GiB address range, which allows more liberal PC-relative addressing. The R_RISCV_64 relocation type gets emitted when you put pointers in the .data section:

void* functions[] = {
    &function1,
    &function2,
};

This also happens when using vtables in C++, and we wanted to support these properly early on, instead of having to fight with horrifying bugs later.

The next piece of the CRT involves the handling of the init arrays (which get emitted by global instances of classes that have a constructor):

typedef void (*InitFunction)();
extern InitFunction __init_array_start;
extern InitFunction __init_array_end;

static __attribute((optnone)) void riscvm_init_arrays()
{
    for (InitFunction* itr = &__init_array_start; itr != &__init_array_end; itr++)
    {
        (*itr)();
    }
}

Frustratingly, we were not able to get this function to generate correct code without the __attribute__((optnone)). We suspect this has to do with aliasing assumptions (the start/end can technically refer to the same memory), but we didn’t investigate this further.

Interpreter internals

Note: the interpreter was initially based on riscvm.c by edubart. However, we have since completely rewritten it in C++ to better suit our purpose.

Based on the RISC-V Calling Conventions document, we can create an enum for the 32 registers:

enum RegIndex
{
    reg_zero, // always zero (immutable)
    reg_ra,   // return address
    reg_sp,   // stack pointer
    reg_gp,   // global pointer
    reg_tp,   // thread pointer
    reg_t0,   // temporary
    reg_t1,
    reg_t2,
    reg_s0,   // callee-saved
    reg_s1,
    reg_a0,   // arguments
    reg_a1,
    reg_a2,
    reg_a3,
    reg_a4,
    reg_a5,
    reg_a6,
    reg_a7,
    reg_s2,   // callee-saved
    reg_s3,
    reg_s4,
    reg_s5,
    reg_s6,
    reg_s7,
    reg_s8,
    reg_s9,
    reg_s10,
    reg_s11,
    reg_t3,   // temporary
    reg_t4,
    reg_t5,
    reg_t6,
};

We just need to add a pc register and we have the structure to represent the RISC-V CPU state:

struct riscvm
{
    int64_t  pc;
    uint64_t regs[32];
};

It is important to keep in mind that the zero register is always set to 0 and we have to prevent writes to it by using a macro:

#define reg_write(idx, value)        \
    do                               \
    {                                \
        if (LIKELY(idx != reg_zero)) \
        {                            \
            self->regs[idx] = value; \
        }                            \
    } while (0)

The instructions (ignoring the optional compression extension) are always 32-bits in length and can be cleanly expressed as a union:

union Instruction
{
    struct
    {
        uint32_t compressed_flags : 2;
        uint32_t opcode           : 5;
        uint32_t                  : 25;
    };

    struct
    {
        uint32_t opcode : 7;
        uint32_t rd     : 5;
        uint32_t funct3 : 3;
        uint32_t rs1    : 5;
        uint32_t rs2    : 5;
        uint32_t funct7 : 7;
    } rtype;

    struct
    {
        uint32_t opcode : 7;
        uint32_t rd     : 5;
        uint32_t funct3 : 3;
        uint32_t rs1    : 5;
        uint32_t rs2    : 5;
        uint32_t shamt  : 1;
        uint32_t imm    : 6;
    } rwtype;

    struct
    {
        uint32_t opcode : 7;
        uint32_t rd     : 5;
        uint32_t funct3 : 3;
        uint32_t rs1    : 5;
        uint32_t imm    : 12;
    } itype;

    struct
    {
        uint32_t opcode : 7;
        uint32_t rd     : 5;
        uint32_t imm    : 20;
    } utype;

    struct
    {
        uint32_t opcode : 7;
        uint32_t rd     : 5;
        uint32_t imm12  : 8;
        uint32_t imm11  : 1;
        uint32_t imm1   : 10;
        uint32_t imm20  : 1;
    } ujtype;

    struct
    {
        uint32_t opcode : 7;
        uint32_t imm5   : 5;
        uint32_t funct3 : 3;
        uint32_t rs1    : 5;
        uint32_t rs2    : 5;
        uint32_t imm7   : 7;
    } stype;

    struct
    {
        uint32_t opcode   : 7;
        uint32_t imm_11   : 1;
        uint32_t imm_1_4  : 4;
        uint32_t funct3   : 3;
        uint32_t rs1      : 5;
        uint32_t rs2      : 5;
        uint32_t imm_5_10 : 6;
        uint32_t imm_12   : 1;
    } sbtype;

    int16_t  chunks16[2];
    uint32_t bits;
};
static_assert(sizeof(Instruction) == sizeof(uint32_t), "");

There are 13 top-level opcodes (Instruction.opcode) and some of those opcodes have another field that further specializes the functionality (i.e. Instruction.itype.funct3). To keep the code readable, the enumerations for the opcode are defined in opcodes.h. The interpreter is structured to have handler functions for the top-level opcode in the following form:

bool handler_rv64_<opcode>(riscvm_ptr self, Instruction inst);

As an example, we can look at the handler for the lui instruction (note that the handlers themselves are responsible for updating pc):

ALWAYS_INLINE static bool handler_rv64_lui(riscvm_ptr self, Instruction inst)
{
    int64_t imm = bit_signer(inst.utype.imm, 20) << 12;
    reg_write(inst.utype.rd, imm);

    self->pc += 4;
    dispatch(); // return true;
}

The interpreter executes until one of the handlers returns false, indicating the CPU has to halt:

void riscvm_run(riscvm_ptr self)
{
    while (true)
    {
        Instruction inst;
        inst.bits = *(uint32_t*)self->pc;
        if (!riscvm_execute_handler(self, inst))
            break;
    }
}

Plenty of articles have been written about the semantics of RISC-V, so you can look at the source code if you’re interested in the implementation details of individual instructions. The structure of the interpreter also allows us to easily implement obfuscation features, which we will discuss in the next section.

For now, we will declare the handler functions as __attribute__((always_inline)) and set the -fno-jump-tables compiler option, which gives us a riscvm_run function that (comfortably) fits into a single page (0xCA4 bytes):

interpreter control flow graph

Hardening features

A regular RISC-V interpreter is fun, but an attacker can easily reverse engineer our payload by throwing it into Ghidra to decompile it. To force the attacker to at least look at our VM interpreter, we implemented a few security features. These features are implemented in a Python script that parses the linker MAP file and directly modifies the opcodes: encrypt.py.

Opcode shuffling

The most elegant (and likely most effective) obfuscation is to simply reorder the enums of the instruction opcodes and sub-functions. The shuffle.py script is used to generate shuffled_opcodes.h, which is then included into riscvm.h instead of opcodes.h to mix the opcodes:

#ifdef OPCODE_SHUFFLING
#warning Opcode shuffling enabled
#include "shuffled_opcodes.h"
#else
#include "opcodes.h"
#endif // OPCODE_SHUFFLING

There is also a shuffled_opcodes.json file generated, which is parsed by encrypt.py to know how to shuffle the assembled instructions.

Because enums are used for all the opcodes, we only need to recompile the interpreter to obfuscate it; there is no additional complexity cost in the implementation.

Bytecode encryption

To increase diversity between payloads for the same VM instance, we also employ a simple ‘encryption’ scheme on top of the opcode:

ALWAYS_INLINE static uint32_t tetra_twist(uint32_t input)
{
    /**
     * Custom hash function that is used to generate the encryption key.
     * This has strong avalanche properties and is used to ensure that
     * small changes in the input result in large changes in the output.
     */

    constexpr uint32_t prime1 = 0x9E3779B1; // a large prime number

    input ^= input >> 15;
    input *= prime1;
    input ^= input >> 12;
    input *= prime1;
    input ^= input >> 4;
    input *= prime1;
    input ^= input >> 16;

    return input;
}

ALWAYS_INLINE static uint32_t transform(uintptr_t offset, uint32_t key)
{
    uint32_t key2 = key + offset;
    return tetra_twist(key2);
}

ALWAYS_INLINE static uint32_t riscvm_fetch(riscvm_ptr self)
{
    uint32_t data;
    memcpy(&data, (const void*)self->pc, sizeof(data));

#ifdef CODE_ENCRYPTION
    return data ^ transform(self->pc - self->base, self->key);
#else
    return data;
#endif // CODE_ENCRYPTION
}

The offset relative to the start of the bytecode is used as the seed to a simple transform function. The result of this function is XOR’d with the instruction data before decoding. The exact transformation doesn’t really matter, because an attacker can always observe the decrypted bytecode at runtime. However, static analysis becomes more difficult and pattern-matching the payload is prevented, all for a relatively small increase in VM implementation complexity.

It would be possible to encrypt the contents of the .data section of the payload as well, but we would have to completely decrypt it in memory before starting execution anyway. Technically, it would be also possible to implement a lazy encryption scheme by customizing the riscvm_read and riscvm_write functions to intercept reads/writes to the payload region, but this idea was not pursued further.

Threaded handlers

The most interesting feature of our VM is that we only need to make minor code modifications to turn it into a so-called threaded interpreter. Threaded code is a well-known technique used both to speed up emulators and to introduce indirect branches that complicate reverse engineering. It is called threading because the execution can be visualized as a thread of handlers that directly branch to the next handler. There is no classical dispatch function, with an infinite loop and a switch case for each opcode inside. The performance improves because there are fewer false-positives in the branch predictor when executing threaded code. You can find more information about threaded interpreters in the Dispatch Techniques section of the YETI paper.

The first step is to construct a handler table, where each handler is placed at the index corresponding to each opcode. To do this we use a small snippet of constexpr C++ code:

typedef bool (*riscvm_handler_t)(riscvm_ptr, Instruction);

static constexpr std::array<riscvm_handler_t, 32> riscvm_handlers = []
{
    // Pre-populate the table with invalid handlers
    std::array<riscvm_handler_t, 32> result = {};
    for (size_t i = 0; i < result.size(); i++)
    {
        result[i] = handler_rv64_invalid;
    }

    // Insert the opcode handlers at the right index
#define INSERT(op) result[op] = HANDLER(op)
    INSERT(rv64_load);
    INSERT(rv64_fence);
    INSERT(rv64_imm64);
    INSERT(rv64_auipc);
    INSERT(rv64_imm32);
    INSERT(rv64_store);
    INSERT(rv64_op64);
    INSERT(rv64_lui);
    INSERT(rv64_op32);
    INSERT(rv64_branch);
    INSERT(rv64_jalr);
    INSERT(rv64_jal);
    INSERT(rv64_system);
#undef INSERT
    return result;
}();

With the riscvm_handlers table populated we can define the dispatch macro:

#define dispatch()                                       \
    Instruction next;                                    \
    next.bits = riscvm_fetch(self);                      \
    if (next.compressed_flags != 0b11)                   \
    {                                                    \
        panic("compressed instructions not supported!"); \
    }                                                    \
    __attribute__((musttail)) return riscvm_handlers[next.opcode](self, next)

The musttail attribute forces the call to the next handler to be a tail call. This is only possible because all the handlers have the same function signature and it generates an indirect branch to the next handler:

threaded handler disassembly

The final piece of the puzzle is the new implementation of the riscvm_run function, which uses an empty riscvm_execute handler to bootstrap the chain of execution:

ALWAYS_INLINE static bool riscvm_execute(riscvm_ptr self, Instruction inst)
{
    dispatch();
}

NEVER_INLINE void riscvm_run(riscvm_ptr self)
{
    Instruction inst;
    riscvm_execute(self, inst);
}

Traditional obfuscation

The built-in hardening features that we can get with a few #ifdefs and a small Python script are good enough for a proof-of-concept, but they are not going to deter a determined attacker for a very long time. An attacker can pattern-match the VM’s handlers to simplify future reverse engineering efforts. To address this, we can employ common obfuscation techniques using LLVM obfuscation passes:

Instruction substitution (to make pattern matching more difficult)
Opaque predicates (to hinder static analysis)
Inject anti-debug checks (to make dynamic analysis more difficult)

The paper Modern obfuscation techniques by Roman Oravec gives a nice overview of literature and has good data on what obfuscation passes are most effective considering their runtime overhead.

Additionally, it would also be possible to further enhance the VM’s security by duplicating handlers, but this would require extra post-processing on the payload itself. The VM itself is only part of what could be obfuscated. Obfuscating the payloads themselves is also something we can do quite easily. Most likely, manually-integrated security features (stack strings with xorstr, lazy_importer and variable encryption) will be most valuable here. However, because we use LLVM to build the payloads we can also employ automated obfuscation there. It is important to keep in mind that any overhead created in the payloads themselves is multiplied by the overhead created by the handler obfuscation, so experimentation is required to find the sweet spot for your use case.

Writing the payloads

The VM described in this post so far technically has the ability to execute arbitrary code. That being said, it would be rather annoying for an end-user to write said code. For example, we would have to manually resolve all imports and then use the riscvm_host_call function to actually execute them. These functions are executing in the RISC-V context and their implementation looks like this:

uintptr_t riscvm_host_call(uintptr_t address, uintptr_t args[13])
{
    register uintptr_t a0 asm("a0") = address;
    register uintptr_t a1 asm("a1") = (uintptr_t)args;
    register uintptr_t a7 asm("a7") = 20000;
    asm volatile("scall" : "+r"(a0) : "r"(a1), "r"(a7));
    return a0;
}

uintptr_t riscvm_get_peb()
{
    register uintptr_t a0 asm("a0") = 0;
    register uintptr_t a7 asm("a7") = 20001;
    asm volatile("scall" : "+r"(a0) : "r"(a7) : "memory");
    return a0;
}

We can get a pointer to the PEB using riscvm_get_peb and then resolve a module by its’ x65599 hash:

// Structure definitions omitted for clarity
uintptr_t riscvm_resolve_dll(uint32_t module_hash)
{
    static PEB* peb = 0;
    if (!peb)
    {
        peb = (PEB*)riscvm_get_peb();
    }
    LIST_ENTRY* begin = &peb->Ldr->InLoadOrderModuleList;
    for (LIST_ENTRY* itr = begin->Flink; itr != begin; itr = itr->Flink)
    {
        LDR_DATA_TABLE_ENTRY* entry = CONTAINING_RECORD(itr, LDR_DATA_TABLE_ENTRY, InLoadOrderLinks);
        if (entry->BaseNameHashValue == module_hash)
        {
            return (uintptr_t)entry->DllBase;
        }
    }
    return 0;
}

Once we’ve obtained the base of the module we’re interested in, we can resolve the import by walking the export table:

uintptr_t riscvm_resolve_import(uintptr_t image, uint32_t export_hash)
{
    IMAGE_DOS_HEADER*       dos_header      = (IMAGE_DOS_HEADER*)image;
    IMAGE_NT_HEADERS*       nt_headers      = (IMAGE_NT_HEADERS*)(image + dos_header->e_lfanew);
    uint32_t                export_dir_size = nt_headers->OptionalHeader.DataDirectory[0].Size;
    IMAGE_EXPORT_DIRECTORY* export_dir =
        (IMAGE_EXPORT_DIRECTORY*)(image + nt_headers->OptionalHeader.DataDirectory[0].VirtualAddress);
    uint32_t* names = (uint32_t*)(image + export_dir->AddressOfNames);
    uint32_t* funcs = (uint32_t*)(image + export_dir->AddressOfFunctions);
    uint16_t* ords  = (uint16_t*)(image + export_dir->AddressOfNameOrdinals);

    for (uint32_t i = 0; i < export_dir->NumberOfNames; ++i)
    {
        char*     name = (char*)(image + names[i]);
        uintptr_t func = (uintptr_t)(image + funcs[ords[i]]);
        // Ignore forwarded exports
        if (func >= (uintptr_t)export_dir && func < (uintptr_t)export_dir + export_dir_size)
            continue;
        uint32_t hash = hash_x65599(name, true);
        if (hash == export_hash)
        {
            return func;
        }
    }

    return 0;
}

Now we can call MessageBoxA from RISC-V with the following code:

// NOTE: We cannot use Windows.h here
#include <stdint.h>

int main()
{
    // Resolve LoadLibraryA
    auto kernel32_dll = riscvm_resolve_dll(hash_x65599("kernel32.dll", false))
    auto LoadLibraryA = riscvm_resolve_import(kernel32_dll, hash_x65599("LoadLibraryA", true))

    // Load user32.dll
    uint64_t args[13];
    args[0] = (uint64_t)"user32.dll";
    auto user32_dll = riscvm_host_call(LoadLibraryA, args);

    // Resolve MessageBoxA
    auto MessageBoxA = riscvm_resolve_import(user32_dll, hash_x65599("MessageBoxA", true));

    // Show a message to the user
    args[0] = 0; // hwnd
    args[1] = (uint64_t)"Hello from RISC-V!"; // msg
    args[2] = (uint64_t)"riscvm"; // title
    args[3] = 0; // flags
    riscvm_host_call(MessageBoxA, args);
}

With some templates/macros/constexpr tricks we can probably get this down to something more readable, but fundamentally this code will always stay annoying to write. Even if calling imports were a one-liner, we would still have to deal with the fact that we cannot use Windows.h (or any of the Microsoft headers for that matter). The reason for this is that we are cross-compiling with Clang. Even if we were to set up the include paths correctly, it would still be a major pain to get everything to compile correctly. That being said, our VM works! A major advantage of RISC-V is that, since the instruction set is simple, once the fundamentals work, we can be confident that features built on top of this will execute as expected.

Whole Program LLVM

Usually, when discussing LLVM, the compilation process is running on Linux/macOS. In this section, we will describe a pipeline that can actually be used on Windows, without making modifications to your toolchain. This is useful if you would like to analyze/fuzz/obfuscate Windows applications, which might only compile the an MSVC-compatible compiler: clang-cl.

Link-time optimization (LTO)

Without LTO, the object files produced by Clang are native COFF/ELF/Mach-O files. Every file is optimized and compiled independently. The linker loads these objects and merges them together into the final executable.

When enabling LTO, the object files are instead LLVM Bitcode (.bc) files. This allows the linker to merge all the LLVM IR together and perform (more comprehensive) whole-program optimizations. After the LLVM IR has been optimized, the native code is generated and the final executable produced. The diagram below comes from the great Link-time optimisation (LTO) post by Ryan Stinnet:

LTO workflow

Compiler wrappers

Unfortunately, it can be quite annoying to write an executable that can replace the compiler. It is quite simple when dealing with a few object files, but with bigger projects it gets quite tricky (especially when CMake is involved). Existing projects are WLLVM and gllvm, but they do not work nicely on Windows. When using CMake, you can use the CMAKE_<LANG>_COMPILER_LAUNCHER variables and intercept the compilation pipeline that way, but that is also tricky to deal with.

On Windows, things are more complex than on Linux. This is because Clang uses a different program to link the final executable and correctly intercepting this process can become quite challenging.

Embedding bitcode

To achieve our goal of post-processing the bitcode of the whole program, we need to enable bitcode embedding. The first flag we need is -flto, which enables LTO. The second flag is -lto-embed-bitcode, which isn’t documented very well. When using clang-cl, you also need a special incantation to enable it:

set(EMBED_TYPE "post-merge-pre-opt") # post-merge-pre-opt/optimized
if(NOT CMAKE_CXX_COMPILER_ID MATCHES "Clang")
    if(WIN32)
        message(FATAL_ERROR "clang-cl is required, use -T ClangCL --fresh")
    else()
        message(FATAL_ERROR "clang compiler is required")
    endif()
elseif(CMAKE_CXX_COMPILER_FRONTEND_VARIANT MATCHES "^MSVC$")
    # clang-cl
    add_compile_options(-flto)
    add_link_options(/mllvm:-lto-embed-bitcode=${EMBED_TYPE})
elseif(WIN32)
    # clang (Windows)
    add_compile_options(-fuse-ld=lld-link -flto)
    add_link_options(-Wl,/mllvm:-lto-embed-bitcode=${EMBED_TYPE})
else()
	# clang (Linux)
    add_compile_options(-fuse-ld=lld -flto)
    add_link_options(-Wl,-lto-embed-bitcode=${EMBED_TYPE})
endif()

The -lto-embed-bitcode flag creates an additional .llvmbc section in the final executable that contains the bitcode. It offers three settings:

-lto-embed-bitcode=<value> - Embed LLVM bitcode in object files produced by LTO
    =none                  - Do not embed 
    =optimized             - Embed after all optimization passes
    =post-merge-pre-opt    - Embed post merge, but before optimizations

Once the bitcode is embedded within the output binary, it can be extracted using llvm-objcopy and disassembled with llvm-dis. This is normally done as the follows:

llvm-objcopy --dump-section=.llvmbc=program.bc program
llvm-dis program.bc > program.ll

Unfortunately, we discovered a bug/oversight in LLD on Windows. The section is extracted without errors, but llvm-dis fails to load the bitcode. The reason for this is that Windows executables have a FileAlignment attribute, leading to additional padding with zeroes. To get valid bitcode, you need to remove some of these trailing zeroes:

import argparse
import sys
import pefile

def main():
    # Parse the arguments
    parser = argparse.ArgumentParser()
    parser.add_argument("executable", help="Executable with embedded .llvmbc section")
    parser.add_argument("--output", "-o", help="Output file name", required=True)
    args = parser.parse_args()
    executable: str = args.executable
    output: str = args.output

    # Find the .llvmbc section
    pe = pefile.PE(executable)
    llvmbc = None
    for section in pe.sections:
        if section.Name.decode("utf-8").strip("\x00") == ".llvmbc":
            llvmbc = section
            break
    if llvmbc is None:
        print("No .llvmbc section found")
        sys.exit(1)

    # Recover the bitcode and write it to a file
    with open(output, "wb") as f:
        data = bytearray(llvmbc.get_data())
        # Truncate all trailing null bytes
        while data[-1] == 0:
            data.pop()
        # Recover alignment to 4
        while len(data) % 4 != 0:
            data.append(0)
        # Add a block end marker
        for _ in range(4):
            data.append(0)
        f.write(data)

if __name__ == "__main__":
    main()

In our testing, this doesn’t have any issues, but there might be cases where this heuristic does not work properly. In that case, a potential solution could be to brute force the amount of trailing zeroes, until the bitcode parses without errors.

Applications

Now that we have access to our program’s bitcode, several applications become feasible:

Write an analyzer to identify potentially interesting locations within the program.
Instrument the bitcode and then re-link the executable, which is particularly useful for code coverage while fuzzing.
Obfuscate the bitcode before re-linking the executable, enhancing security.
IR retargeting, where the bitcode compiled for one architecture can be used on another.

Relinking the executable

The bitcode itself unfortunately does not contain enough information to re-link the executable (although this is something we would like to implement upstream). We could either manually attempt to reconstruct the linker command line (with tools like Process Monitor), or use LLVM plugin support. Plugin support is not really functional on Windows (although there is some indication that Sony is using it for their PS4/PS5 toolchain), but we can still load an arbitrary DLL using the -load command line flag. Once we loaded our DLL, we can hijack the executable command line and process the flags to generate a script for re-linking the program after our modifications are done.

Retargeting LLVM IR

Ideally, we would want to write code like this and magically get it to run in our VM:

#include <Windows.h>

int main()
{
    MessageBoxA(0, "Hello from RISC-V!", "riscvm", 0);
}

Luckily this is entirely possible, it just requires writing a (fairly) simple tool to perform transformations on the Bitcode of this program (built using clang-cl). In the coming sections, we will describe how we managed to do this using Microsoft Visual Studio’s official LLVM integration (i.e. without having to use a custom fork of clang-cl).

The LLVM IR of the example above looks roughly like this (it has been cleaned up slightly for readability):

source_filename = "hello.c"
target datalayout = "e-m:w-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-pc-windows-msvc19.38.33133"

@message = dso_local global [19 x i8] c"Hello from RISC-V!\00", align 16
@title = dso_local global [7 x i8] c"riscvm\00", align 1

; Function Attrs: noinline nounwind optnone uwtable
define dso_local i32 @main() #0 {
  %1 = call i32 @MessageBoxA(ptr noundef null, ptr noundef @message, ptr noundef @title, i32 noundef 0)
  ret i32 0
}

declare dllimport i32 @MessageBoxA(ptr noundef, ptr noundef, ptr noundef, i32 noundef) #1

attributes #0 = { noinline nounwind optnone uwtable "min-legal-vector-width"="0" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+cx8,+fxsr,+mmx,+sse,+sse2,+x87" "tune-cpu"="generic" }
attributes #1 = { "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+cx8,+fxsr,+mmx,+sse,+sse2,+x87" "tune-cpu"="generic" }

!llvm.linker.options = !{!0, !0}
!llvm.module.flags = !{!1, !2, !3}
!llvm.ident = !{!4}

!0 = !{!"/DEFAULTLIB:uuid.lib"}
!1 = !{i32 1, !"wchar_size", i32 2}
!2 = !{i32 8, !"PIC Level", i32 2}
!3 = !{i32 7, !"uwtable", i32 2}
!4 = !{!"clang version 16.0.5"}

To retarget this code to RISC-V, we need to do the following:

Collect all the functions with a dllimport storage class.
Generate a riscvm_imports function that resolves all the function addresses of the imports.
Replace the dllimport functions with stubs that use riscvm_host_call to call the import.
Change the target triple to riscv64-unknown-unknown and adjust the data layout.
Compile the retargeted bitcode and link it together with crt0 to create the final payload.

Adjusting the metadata

After loading the LLVM IR Module, the first step is to change the DataLayout and the TargetTriple to be what the RISC-V backend expects:

module.setDataLayout("e-m:e-p:64:64-i64:64-i128:128-n32:64-S128");
module.setTargetTriple("riscv64-unknown-unknown");
module.setSourceFileName("transpiled.bc");

The next step is to collect all the dllimport functions for later processing. Additionally, a bunch of x86-specific function attributes are removed from every function:

std::vector<Function*> importedFunctions;
for (Function& function : module.functions())
{
	// Remove x86-specific function attributes
	function.removeFnAttr("target-cpu");
	function.removeFnAttr("target-features");
	function.removeFnAttr("tune-cpu");
	function.removeFnAttr("stack-protector-buffer-size");

	// Collect imported functions
	if (function.hasDLLImportStorageClass() && !function.getName().startswith("riscvm_"))
	{
		importedFunctions.push_back(&function);
	}
	function.setDLLStorageClass(GlobalValue::DefaultStorageClass);

Finally, we have to remove the llvm.linker.options metadata to make sure we can pass the IR to llc or clang without errors.

Import map

The LLVM IR only has the dllimport storage class to inform us that a function is imported. Unfortunately, it does not provide us with the DLL the function comes from. Because this information is only available at link-time (in files like user32.lib), we decided to implement an extra -importmap argument.

The extract-bc script that extracts the .llvmbc section now also has to extract the imported functions and what DLL they come from:

with open(importmap, "wb") as f:
	for desc in pe.DIRECTORY_ENTRY_IMPORT:
		dll = desc.dll.decode("utf-8")
		for imp in desc.imports:
			name = imp.name.decode("utf-8")
			f.write(f"{name}:{dll}\n".encode("utf-8"))

Currently, imports by ordinal and API sets are not supported, but we can easily make sure those do not occur when building our code.

Creating the import stubs

For every dllimport function, we need to add some IR to riscvm_imports to resolve the address. Additionally, we have to create a stub that forwards the function arguments to riscvm_host_call. This is the generated LLVM IR for the MessageBoxA stub:

; Global variable to hold the resolved import address
@import_MessageBoxA = private global ptr null

define i32 @MessageBoxA(ptr noundef %0, ptr noundef %1, ptr noundef %2, i32 noundef %3) local_unnamed_addr #1 {
entry:
  %args = alloca ptr, i32 13, align 8
  %arg3_zext = zext i32 %3 to i64
  %arg3_cast = inttoptr i64 %arg3_zext to ptr
  %import_address = load ptr, ptr @import_MessageBoxA, align 8
  %arg0_ptr = getelementptr ptr, ptr %args, i32 0
  store ptr %0, ptr %arg0_ptr, align 8
  %arg1_ptr = getelementptr ptr, ptr %args, i32 1
  store ptr %1, ptr %arg1_ptr, align 8
  %arg2_ptr = getelementptr ptr, ptr %args, i32 2
  store ptr %2, ptr %arg2_ptr, align 8
  %arg3_ptr = getelementptr ptr, ptr %args, i32 3
  store ptr %arg3_cast, ptr %arg3_ptr, align 8
  %return = call ptr @riscvm_host_call(ptr %import_address, ptr %args)
  %return_cast = ptrtoint ptr %return to i64
  %return_trunc = trunc i64 %return_cast to i32
  ret i32 %return_trunc
}

The uint64_t args[13] array is allocated on the stack using the alloca instruction and every function argument is stored in there (after being zero-extended). The GlobalVariable named import_MessageBoxA is read and finally riscvm_host_call is executed to call the import on the host side. The return value is truncated as appropriate and returned from the stub.

The LLVM IR for the generated riscvm_imports function looks like this:

; Global string for LoadLibraryA
@str_USER32.dll = private constant [11 x i8] c"USER32.dll\00"

define void @riscvm_imports() {
entry:
  %args = alloca ptr, i32 13, align 8
  %kernel32.dll_base = call ptr @riscvm_resolve_dll(i32 1399641682)
  %import_LoadLibraryA = call ptr @riscvm_resolve_import(ptr %kernel32.dll_base, i32 -550781972)
  %arg0_ptr = getelementptr ptr, ptr %args, i32 0
  store ptr @str_USER32.dll, ptr %arg0_ptr, align 8
  %USER32.dll_base = call ptr @riscvm_host_call(ptr %import_LoadLibraryA, ptr %args)
  %import_MessageBoxA = call ptr @riscvm_resolve_import(ptr %USER32.dll_base, i32 -50902915)
  store ptr %import_MessageBoxA, ptr @import_MessageBoxA, align 8
  ret void
}

The resolving itself uses the riscvm_resolve_dll and riscvm_resolve_import functions we discussed in a previous section. The final detail is that user32.dll is not loaded into every process, so we need to manually call LoadLibraryA to resolve it.

Instead of resolving the DLL and import hashes at runtime, they are resolved by the transpiler at compile-time, which makes things a bit more annoying to analyze for an attacker.

Trade-offs

While the retargeting approach works well for simple C++ code that makes use of the Windows API, it currently does not work properly when the C/C++ standard library is used. Getting this to work properly will be difficult, but things like std::vector can be made to work with some tricks. The limitations are conceptually quite similar to driver development and we believe this is a big improvement over manually recreating types and manual wrappers with riscvm_host_call.

An unexplored potential area for bugs is the unverified change to the DataLayout of the LLVM module. In our tests, we did not observe any differences in structure layouts between rv64 and x64 code, but most likely there are some nasty edge cases that would need to be properly handled.

If the code written is mainly cross-platform, portable C++ with heavy use of the STL, an alternative design could be to compile most of it with a regular C++ cross-compiler and use the retargeting only for small Windows-specific parts.

One of the biggest advantages of retargeting a (mostly) regular Windows C++ program is that the payload can be fully developed and tested on Windows itself. Debugging is much more difficult once the code becomes RISC-V and our approach fully decouples the development of the payload from the VM itself.

CRT0

The final missing piece of the crt0 component is the _start function that glues everything together:

static void exit(int exit_code);
static void riscvm_relocs();
void        riscvm_imports() __attribute__((weak));
static void riscvm_init_arrays();
extern int __attribute((noinline)) main();

// NOTE: This function has to be first in the file
void _start()
{
    riscvm_relocs();
    riscvm_imports();
    riscvm_init_arrays();
    exit(main());
    asm volatile("ebreak");
}

void riscvm_imports()
{
    // Left empty on purpose
}

The riscvm_imports function is defined as a weak symbol. This means the implementation provided in crt0.c can be overwritten by linking to a stronger symbol with the same name. If we generate a riscvm_imports function in our retargeted bitcode, that implementation will be used and we can be certain we execute before main!

Example `payload` project

Now that all the necessary tooling has been described, we can put everything together in a real project! In the repository, this is all done in the payload folder. To make things easy, this is a simple cmkr project with a template to enable the retargeting scripts:

# Reference: https://build-cpp.github.io/cmkr/cmake-toml
[cmake]
version = "3.19"
cmkr-include = "cmake/cmkr.cmake"

[project]
name = "payload"
languages = ["CXX"]
cmake-before = "set(CMAKE_CONFIGURATION_TYPES Debug Release)"
include-after = ["cmake/riscvm.cmake"]
msvc-runtime = "static"

[fetch-content.phnt]
url = "https://github.com/mrexodia/phnt-single-header/releases/download/v1.2-4d1b102f/phnt.zip"

[template.riscvm]
type = "executable"
add-function = "add_riscvm_executable"

[target.payload]
type = "riscvm"
sources = [
    "src/main.cpp",
    "crt/minicrt.c",
    "crt/minicrt.cpp",
]
include-directories = [
    "include",
]
link-libraries = [
    "riscvm-crt0",
    "phnt::phnt",
]
compile-features = ["cxx_std_17"]
msvc.link-options = [
    "/INCREMENTAL:NO",
    "/DEBUG",
]

In this case, the add_executable function has been replaced with an equivalent add_riscvm_executable that creates an additional payload.bin file that can be consumed by the riscvm interpreter. The only thing we have to make sure of is to enable clang-cl when configuring the project:

cmake -B build -T ClangCL

After this, you can open build\payload.sln in Visual Studio and develop there as usual. The custom cmake/riscvm.cmake script does the following:

Enable LTO
Add the -lto-embed-bitcode linker flag
Locale clang.exe, ld.lld.exe and llvm-objcopy.exe
Compile crt0.c for the riscv64 architecture
Create a Python virtual environment with the necessary dependencies

The add_riscvm_executable adds a custom target that processes the regular output executable and executes the retargeter and relevant Python scripts to produce the riscvm artifacts:

function(add_riscvm_executable tgt)
    add_executable(${tgt} ${ARGN})
    if(MSVC)
        target_compile_definitions(${tgt} PRIVATE _NO_CRT_STDIO_INLINE)
        target_compile_options(${tgt} PRIVATE /GS- /Zc:threadSafeInit-)
    endif()
    set(BC_BASE "$<TARGET_FILE_DIR:${tgt}>/$<TARGET_FILE_BASE_NAME:${tgt}>")
    add_custom_command(TARGET ${tgt}
        POST_BUILD
        USES_TERMINAL
        COMMENT "Extracting and transpiling bitcode..."
        COMMAND "${Python3_EXECUTABLE}" "${RISCVM_DIR}/extract-bc.py" "$<TARGET_FILE:${tgt}>" -o "${BC_BASE}.bc" --importmap "${BC_BASE}.imports"
        COMMAND "${TRANSPILER}" -input "${BC_BASE}.bc" -importmap "${BC_BASE}.imports" -output "${BC_BASE}.rv64.bc"
        COMMAND "${CLANG_EXECUTABLE}" ${RV64_FLAGS} -c "${BC_BASE}.rv64.bc" -o "${BC_BASE}.rv64.o"
        COMMAND "${LLD_EXECUTABLE}" -o "${BC_BASE}.elf" --oformat=elf -emit-relocs -T "${RISCVM_DIR}/lib/linker.ld" "--Map=${BC_BASE}.map" "${CRT0_OBJ}" "${BC_BASE}.rv64.o"
        COMMAND "${OBJCOPY_EXECUTABLE}" -O binary "${BC_BASE}.elf" "${BC_BASE}.pre.bin"
        COMMAND "${Python3_EXECUTABLE}" "${RISCVM_DIR}/relocs.py" "${BC_BASE}.elf" --binary "${BC_BASE}.pre.bin" --output "${BC_BASE}.bin"
        COMMAND "${Python3_EXECUTABLE}" "${RISCVM_DIR}/encrypt.py" --encrypt --shuffle --map "${BC_BASE}.map" --shuffle-map "${RISCVM_DIR}/shuffled_opcodes.json" --opcodes-map "${RISCVM_DIR}/opcodes.json" --output "${BC_BASE}.enc.bin" "${BC_BASE}.bin"
        VERBATIM
    )
endfunction()

While all of this is quite complex, we did our best to make it as transparent to the end-user as possible. After enabling Visual Studio’s LLVM support in the installer, you can start developing VM payloads in a few minutes. You can get a precompiled transpiler binary from the releases.

Debugging in `riscvm`

When debugging the payload, it is easiest to load payload.elf in Ghidra to see the instructions. Additionally, the debug builds of the riscvm executable have a --trace flag to enable instruction tracing. The execution of main in the MessageBoxA example looks something like this (labels added manually for clarity):

                      main:
0x000000014000d3a4:   addi     sp, sp, -0x10 = 0x14002cfd0
0x000000014000d3a8:   sd       ra, 0x8(sp) = 0x14000d018
0x000000014000d3ac:   auipc    a0, 0x0 = 0x14000d4e4
0x000000014000d3b0:   addi     a1, a0, 0xd6 = 0x14000d482
0x000000014000d3b4:   auipc    a0, 0x0 = 0x14000d3ac
0x000000014000d3b8:   addi     a2, a0, 0xc7 = 0x14000d47b
0x000000014000d3bc:   addi     a0, zero, 0x0 = 0x0
0x000000014000d3c0:   addi     a3, zero, 0x0 = 0x0
0x000000014000d3c4:   jal      ra, 0x14 -> 0x14000d3d8
                        MessageBoxA:
0x000000014000d3d8:     addi     sp, sp, -0x70 = 0x14002cf60
0x000000014000d3dc:     sd       ra, 0x68(sp) = 0x14000d3c8
0x000000014000d3e0:     slli     a3, a3, 0x0 = 0x0
0x000000014000d3e4:     srli     a4, a3, 0x0 = 0x0
0x000000014000d3e8:     auipc    a3, 0x0 = 0x0
0x000000014000d3ec:     ld       a3, 0x108(a3=>0x14000d4f0) = 0x7ffb3c23a000
0x000000014000d3f0:     sd       a0, 0x0(sp) = 0x0
0x000000014000d3f4:     sd       a1, 0x8(sp) = 0x14000d482
0x000000014000d3f8:     sd       a2, 0x10(sp) = 0x14000d47b
0x000000014000d3fc:     sd       a4, 0x18(sp) = 0x0
0x000000014000d400:     addi     a1, sp, 0x0 = 0x14002cf60
0x000000014000d404:     addi     a0, a3, 0x0 = 0x7ffb3c23a000
0x000000014000d408:     jal      ra, -0x3cc -> 0x14000d03c
                          riscvm_host_call:
0x000000014000d03c:       lui      a2, 0x5 = 0x14000d47b
0x000000014000d040:       addiw    a7, a2, -0x1e0 = 0x4e20
0x000000014000d044:       ecall    0x4e20
0x000000014000d048:       ret      (0x14000d40c)
0x000000014000d40c:     ld       ra, 0x68(sp=>0x14002cfc8) = 0x14000d3c8
0x000000014000d410:     addi     sp, sp, 0x70 = 0x14002cfd0
0x000000014000d414:     ret      (0x14000d3c8)
0x000000014000d3c8:   addi     a0, zero, 0x0 = 0x0
0x000000014000d3cc:   ld       ra, 0x8(sp=>0x14002cfd8) = 0x14000d018
0x000000014000d3d0:   addi     sp, sp, 0x10 = 0x14002cfe0
0x000000014000d3d4:   ret      (0x14000d018)
0x000000014000d018: jal      ra, 0x14 -> 0x14000d02c
                      exit:
0x000000014000d02c:   lui      a1, 0x2 = 0x14002cf60
0x000000014000d030:   addiw    a7, a1, 0x710 = 0x2710
0x000000014000d034:   ecall    0x2710

The tracing also uses the enums for the opcodes, so it works with shuffled and encrypted payloads as well.

Outro

Hopefully this article has been an interesting read for you. We tried to walk you through the process in the same order we developed it in, but you can always refer to the riscy-business GitHub repository and try things out for yourself if you got confused along the way. If you have any ideas for improvements, or would like to discuss, you are always welcome in our Discord server!

We would like to thank the following people for proofreading and discussing the design and implementation with us (alphabetical order):

Brit
herrcore
JustMagic
Renegade
veritas

Additionally, we highly appreciate the open source projects that we built this project on! If you use this project, consider giving back your improvements to the community as well.

Merry Christmas!