Assemblers

UNDER CONSTRUCTION

This archive provides a description of popular assemblers and their architecture-specific differences.

Assemblers

GCC generates assembly code and invokes GNU Assembler (also known as "gas"), which is part of GNU Binutils, to convert the assembly code into machine code. The GCC driver is also capable of accepting assembly input files. Due to GCC's widespread usage, GNU Assembler is arguably the most popular assembler.

Within the LLVM project, the LLVM integrated assembler is a library that is linked by Clang, llvm-mc, and lld (for LTO purposes) to generate machine code. It supports a wide range of GNU Assembler syntax and can be used as a drop-in replacement for GNU Assembler.

On the Windows platform, the Microsoft Macro Assembler (MASM) is widely used.

For x86 architecture, NASM is another popular assembler.

Architectures

x86

There are two main branches of syntax: Intel syntax and AT&T syntax. AT&T syntax is derived from PDP-11 and exhibits several key differences:

The operand list is reversed compared to Intel syntax.
The four-part generic addressing mode is written as displacement(base,index,scale) instead of [base+index*scale+disp] in Intel syntax.
Immediate values are prefixed with $, while registers are prefixed with %.
The mnemonics have a suffix indicating the operand size, e.g. b for 1 byte, w for 2 bytes (Word), d for 4 bytes (Dword), and q for 8 bytes (Qword).

Although the sigils add some complexity to the language, they do provide a distinct advantage: symbol references can be parsed without ambiguity. Many x86 instructions take an operand that can be a register or a memory location. With sigils, parsing becomes unambiguous, as demonstrated by examples such as subl var, %eax and subl $1, %eax.

% gcc -S a.c
% cat a.s
...
        movl    var(%rip), %eax
        addl    $3, %eax
% gcc -S -masm=intel a.c
% cat a.s
...
        .intel_syntax noprefix
...
        mov     eax, DWORD PTR var[rip]
        add     eax, 3

Intel syntax is generally concise, except for the verbose size directives (e.g., DWORD PTR). It is widely utilized in the Windows environment and within the reverse engineering community.

However, Intel syntax has a flaw related to ambiguity, as it prevents the use of variable names that collide with registers (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53929).

% cat ambiguous.c
int *rip, rax;
int foo() { return rip[rax]; }
% gcc -S -masm=intel ambiguous.c -o -
...
        mov     rax, QWORD PTR rip[rip]
        mov     edx, DWORD PTR rax[rip]
% gcc -c -masm=intel ambiguous.c
/tmp/ccEOMwm6.s: Assembler messages:
/tmp/ccEOMwm6.s:28: Error: invalid use of register
/tmp/ccEOMwm6.s:29: Error: invalid use of register

I believe it would be beneficial if the designers added sigils to Intel syntax to disambiguate symbol references from registers. The absence of AT&T-style line noise makes Intel syntax code much more readable. Unfortunately, Intel syntax is less popular in software code due to GCC defaulting to AT&T syntax (Please, really, make -masm=intel the default for x86.

Using as -msyntax=intel -mnaked-reg allows parsing the input in Intel syntax without a register prefix. This is similar to including a .intel_syntax noprefix directive in the input.

With llvm-mc -x86-asm-syntax=intel, the input can be parsed in Intel syntax. Using -output-asm-variant=1 will print instructions in Intel syntax.

MIPS

Modifiers are utilized to describe different access types of a symbol. This serves as a bonus as it prevents symbol references from being mistaken as register names. However, the function call-like syntax can appear verbose.

lui     a0, %tprel_hi(tls)
add     a0, a0, tp, %tprel_add(tls)
lw      a0, %tprel_lo(tls)(a0)
lui     a1, %hi(var)
lw      a2, %lo(var)(a1)

Power ISA

Power ISA assembly may seem unusual, as general-purpose registers are not prefixed with the r prefix. Whether an integer denotes a register or an immediate value depends on its position as an operand in an instruction. I find that this difference slightly affects readability.

Similar to x86, postfix modifiers are used to describe different access kinds of a symbol.

AArch64

Prefix modifiers are used to describe various access types of a symbol. Personally, this is the modifier syntax that I prefer the most.

add     x8, x8, :tprel_hi12:tls
add     x8, x8, :tprel_lo12_nc:tls
adrp    x8, fp
ldr     x8, [x8, :lo12:fp]

RISC-V

The modifier syntax is copied from MIPS.

The documentation is available on https://github.com/riscv-non-isa/riscv-asm-manual/blob/master/riscv-asm.md.

Inline assembly

Certain compilers allow the inclusion of assembly code within a high-level language.

The most widely used implementation is GCC Basic Asm and Extended Asm. On Windows, MSVC supports inline assembly for x86-32 but not for x86-64 and Arm.

Clang supports both GCC and MSVC inline assembly. Clang's MSVC inline assembly can be utilized with x86-64.

Some compilers provide additional variants of inline assembly. Here are some relevant links:

Notes on GNU Assembler

.file and .loc directives are used to create .debug_line.

.cfi* directives are used to create .eh_frame or .debug_frame.

GNU Assembler implements "INDEFINITE REPEAT BLOCK DIRECTIVES: .IRP AND .IRPC" from MACRO-11. Unfortunately there is no directive for for (int i = 0; i < 20; i++). .irpc i,0123456789 just gives 10 iterations and writing all integers using .irp is tedious and error-prone.

.rept 3
  ret
.endr

.irpc i,012
  movq $\i, %rax
.endr

.irp i,%rax,%rbx,%rcx
  movq \i, %rax
.endr

.if, .ifdef, and .ifndef directives allow us to write conditional code in assembly tests without using a C preprocessor. I often use .ifdef to combine positive tests and negative tests in one file.

# RUN: llvm-mc %s | FileCheck %s
# RUN: not llvm-mc --defsym ERR=1 %s -o /dev/null 2>&1 | FileCheck %s --check-prefix=ERR

# CHECK: ...
## positive tests

.ifdef ERR
# ERR: ...
## negatives tests
.endif

GNU Assembler has supported .incbin since 2001-07 (hey, C/C++ #embed). The review thread mentioned that .incbin had been supported by some other assemblers.

Notes on LLVM integrated assembler

In general, inline assembly is parsed by LLVMMCParser for validation and formatting purposes. Parsing can be disabled for certain targets by default, and the parsing can be explicitly disabled by using the -fno-integrated-as option.

Let's focus on ELF platforms for the following description, assuming our goal is to create a relocatable object file. The input file can be either LLVM IR (intermediate code; the initial input file may be in C/C++) or assembly language.

If the input is LLVM IR, LLVM creates a MCObjectStreamer object with new MCELFStreamer or a target-registered factory (e.g., AArch64ELFStreamer). The streamer constructor creates a MCAssembler object. For an assembly input file, LLVM additionally creates a MCAsmParser object and a MCTargetAsmParser object.

MSVC inline assembly

TODO

__asm blocks are parsed for Windows target triples. This extension is available on other targets by specifying -fasm-blocks or the broad -fms-extensions. An __asm statement is represented as a clang::MSAsmStmt object. clang::Parser::ParseMicrosoftAsmStatement parses the inline assembly string and calls llvm::AsmParser::parseMSInlineAsm. It is worth noting that the string may be modified during this process. For a clang::MSAsmStmt object, LLVM IR is generated through clang::CodeGen::CodeGenFunction::EmitAsmStmt.