Symbol address
In an executable or shared object (called a component in ELF), a text section may need the absolute virtual address of a symbol (e.g. a function or a variable). The reference arises from an address taken operation or a PLT entry. The address may be:
- a link-time constant
- the load base plus a link-time constant
- dependent on runtime computation by ld.so
Link-time constant
For the first case, this component must be a position-dependent executable: a link-time address equals its virtual address at runtime. The text section can hold the absolute virtual address directly or use a PC-relative addressing.
1 | # i386 |
(For a FDPIC ABI for MMU-less Linux, the compiler may add an offset to the FDPIC register instead.)
Load base plus constant
For the second case, this component is either a position-independent executable or a shared object. The difference between the link-time addresses of two symbols equals their virtual address difference at runtime. The first byte of the program image, the ELF header, is loaded at the load base. The text section can get the current program counter, then add the distance from PC to the symbol (PC-relative address), to compute the runtime virtual address.
1 | # x86_64 |
Runtime computation by ld.so
For the third case, we need help from the runtime loader (abbreviated as ld.so). The linker emits a dynamic relocation to let the runtime loader perform a symbol lookup to determine the associated symbol value at runtime.
The symbol is either potentially defined in another component or is a STT_GNU_IFUNC
symbol. See GNU indirect function for STT_GNU_IFUNC
.
If the text section holds the address which is relocated by the dynamic relocation, this is called text relocations.
More commonly, the address is stored in the Global Offset Table (abbreviated as GOT). The compiler emits code which uses position-independent addressing to extract the absolute virtual address from GOT. The relocations (e.g. R_AARCH64_ADR_GOT_PAGE
, R_X86_64_REX_GOTPCRELX
) are called GOT-generating. The linker will create entries in the Global Offset Table.
1 | # aarch64 |
Global Offset Table
The Global Offset Table (usually consists of .got
and .got.plt
) holds the symbol addresses which are referenced by text sections. The table holds link-time constant entries and entries which are relocated by a dynamic relocation.
Why do we need a GOT entry for a link-time constant? Well, at compile time it is probably undecided whether the entry may resolve to another component. The compiler may emit a GOT-generating relocation and use an indirection in a conservative manner. At link time the linker may find that the value is a constant.
Life of a .got.plt entry
TODO: link to my future article about PLT.
Life of a .got entry
Compiler behavior
Defined symbols
Defined symbols generally belong to the first and second cases. However, on ELF, a non-local default visibility symbol in a shared object is preemptible by default. For -fpic
code, the third case is used: since such a definition may be interposed by another definition at runtime, the compiler conservatively uses GOT indirection.
1 | int var; |
1 | # -fno-pic or -fpie |
Using the C/C++ internal linkage (static
, unnamed namespace) or protected/hidden visibility can avoid the indirection for -fpic
.
See Copy relocations, canonical PLT entries and protected visibility for why GCC protected data uses (unneeded) indirection.
Undefined symbols
If the symbol has the default visibility, the definition may be in a different component. For position independent code (-fpie
and -fpic
), the compiler uses GOT indirection conservatively.
1 | extern int ext_var; |
1 | movq ext_var@GOTPCREL(%rip), %rax |
For position dependent code (-fno-pic
), traditionally the compiler optimizes for statically linked executables and uses direct addressing (usually absolute relocations). How does it work if the symbol is actually defined in a shared object? To avoid text relocations, there are copy relocations and canonical PLT entries. It essentially changes the third case (symbol lookup) to the first two cases. See Copy relocations, canonical PLT entries and protected visibility for details.
If the symbol has a non-default visibility, the definition must be defined in the component. The compiler can safely assume the address is either a link-time constant or the load base plus a constant.
1 | __attribute__((visibility("hidden"))) |
1 | movl ext_var(%rip), %eax |
Linker behavior
A GOT-generating relocation references a symbol. When the linker sees such a referenced symbol for the first time, it reserves an entry in GOT. For subsequent GOT-generating relocations referencing the same symbol, the linker just reuses this entry. The address of the GOT entry is insignificant.
Technically the linker can use multiple entries for one symbol. It just wastes space for the majority of cases, but some awful ABIs do use multi-GOT, e.g. mips and ppc32.
The entry needs a dynamic relocation or is a link-time constant.
1 | if (preemptible) |
ld.so behavior
An R_*_GLOB_DAT
relocation is identical to an absolute relocation of the word size (e.g. R_AARCH64_ABS64
, R_X86_64_64
). ld.so performs a symbol lookup and fills the location with the virtual address.
GOT optimization
GOT indirection to PC-relative
When the symbol associated to a GOT entry is non-preemptible, the third case effectively becomes the first or the second case. The code sequence nevertheless has a load from the GOT entry. Why dont't we optimize the code sequence?
Some psABI (Processor Specific Application Binary Interface) documents do define such an optimization.
For example, x86-64's R_X86_64_REX_GOTPCRELX
optimization does the following transformation:
1 | # input |
PowerPC64 ELFv2's TOC-indirect to TOC-relative optimization:
1 | # input |
On Mach-O, ld64's arm64 port defines some GOT optimization as well.
1 | .globl _main |
For a regular adrp+ldr+ldr code sequence loading the value of a variable through GOT indirection, either the first two instructions (adrp+ldr) can be optimized (computing the GOT address by PC-relative), or the three instructions can be optimized as a whole (load the variable directly via LDR (literal)).
Combining .got and .got.plt
The x86-64 psABI defines another optimization: if a symbol needs both a .got
entry (R_X86_64_GLOB_DAT
; address taking) and a .got.plt
entry (R_X86_64_JUMP_SLOT
), the two entries can be combined into one. In GNU ld, the new entry is added to .plt.got
.
LLD does not implement this optimization: https://bugs.llvm.org/show_bug.cgi?id=32938. I think the optimization has low value but high linker complexity.
DT_MIPS_LOCAL_GOTNO
and DT_MIPS_SYMTABNO-DT_MIPS_GOTSYM
As stated previously, some GOT entries are for non-preemptible symbols. For -pie
and -shared
links, they need relative relocations. Recording R_MIPS_RELATIVE
relocations is bit expensive, so mips optimizes them out by reordering GOT entries to the start. The linker emits DT_MIPS_LOCAL_GOTNO
the linker applies relative relocation operations on the first DT_MIPS_LOCAL_GOTNO
GOT entries.
A regular REL format relocation costs 2 words. mips does micro optimization here again by using just one word for DT_MIPS_SYMTABNO-DT_MIPS_GOTSYM
GOT entries which are otherwise relocated by R_MIPS_JUMP_SLOT
.
Hey, this seems clever, isn't it? No, it's awful.
There is a more useful technique which can speed up symbol lookup: DT_GNU_HASH
. Both mips and DT_GNU_HASH
sort the dynamic symbol table, but in a different way, so DT_GNU_HASH
is incompatible on mips. To overcome this shortcoming, some folks added DT_MIPS_XHASH
support to binutils and glibc. Their scheme adds another table to the GNU hash table, giving back some space they saved.
Sorry to be blunt, but let me add more arguments why mips was shortsighted. Relative relocations have a much better size saving technique: DT_RELR
. If an R_X86_64_REX_GOTPCRELX
like GOT optimization technique is used, many non-preemptible GOT entries will not be needed at all.
If someone tries to add DT_MIPS_XHASH
support to LLVM, I'll definitely object.
To future architectures, GOT optimization is somewhat useful. When designing relocation types, make sure GOT optimization can be retroactively added.
The aarch64 ABI is trying to add GOT optimization. Adding new relocation types require bleeding edge toolchain support, while overloading old GOT-generating relocations needs to be careful with the semantics. Instruction rewriting can easily break the program if not careful.
More about the linker-loader protocol
_GLOBAL_OFFSET_TABLE_
GNU ld defines the symbol relative to the Global Offset Table.
- The aarch64, arm, mips, ppc, and riscv ports define the symbol at the start of
.got
. - The x86 port defines the symbol at the start of
.got.plt
.
Code can use the symbol to access GOT entries.
IMO only ancient (badly designed) architectures reference _GLOBAL_OFFSET_TABLE_
directly. Modern architectures use operand modifiers.
1 | # i386 |
With GOT optimization, a GOT entry can be suppressed. If _GLOBAL_OFFSET_TABLE_
is referenced directly, the linker needs to define it even if it is otherwise unused.
_GLOBAL_OFFSET_TABLE_[0]
So GNU ld and glibc introduced more hacks in the dark age. In nearly every port, _GLOBAL_OFFSET_TABLE_[0]
is the link-time address of _DYNAMIC
(the start of .dynamic
/PT_DYNAMIC
). dl-machine.h
files use this approach to compute the load base (the virtual address of the ELF header).
1 | runtime_DYNAMIC = PC relative address of _DYNAMIC |
In 2012, GNU ld and gold (included in binutils 2.23) started to define __ehdr_start
which has the link-time address zero. Using a PC relative code sequence to take the runtime address of __ehdr_start
gives us a better way to get the load base. I submitted patches to use the approach for aarch64/arm/riscv/x86_64. The changes will be included in glibc 2.35.
.got.plt[1...2]
TODO: link to my future article about PLT.
DT_PLTGOT
is defined as the address of .got.plt
.
The linker reserves the first 3 entries of .got.plt
. .plt
usually starts with a header which calls .got.plt[2]
with an argument .got.plt[1]
and other arch-specific arguments.
ld.so puts a descriptor into .got.plt[1]
and the address of the lazy PLT resolver into .got.plt[2]
. The lazy PLT resolver identifies the caller object with the descriptor and uses other arguments to figure out the to-be-called function.
PT_GNU_RELRO
.got
and .got.plt
have the SHF_WRITE
flag. Traditionally they are always writable, which is considered bad from the security perspective. GNU invented the PT_GNU_RELRO
program header.
The idea is that .got
only contains relocations which should be eagerly resolved. With -z relro
, the linker places .got
into PT_GNU_RELRO
. At runtime, after ld.so resolved relocations for an object, it calls mprotect(relro_start, relro_size, PROT_READ)
to mark the .got
region read-only. This is sometimes called "partial RELRO".
(I reported https://sourceware.org/bugzilla/show_bug.cgi?id=24769 that GNU ld's riscv port doesn't implement partial RELRO correctly.)
With -z relro -z now
, the linker additionally places .got.plt
into PT_GNU_RELRO
. At runtime, ld.so resolves .got.plt
relocations eagerly and then calls mprotect
. This scheme disables lazy binding PLT. It is sometimes called "full RELRO". When the program has many R_*_JUMP_SLOT
relocations, there may be significant startup slowdown.
Non-address GOT entries
GOT has some reserved entries at the start of .got
and .got.plt
. Most remaining entries are symbol addresses. The rest are tls_index
objects (module ID and offset from dtv[m] to the symbol for general-dynamic/local-dynamic TLS models), TLS descriptors, and TP offsets.
PowerPC64 ELFv2 TOC
TODO: Move this to a future PowerPC64 article.
Somehow PowerPC64 ELFv2 decided to reinvent GOT. They call it TOC (table of contents).
1 | extern int var0; |
1 | addis 3, 2, .LC0@toc@ha |
While with .got
.o files do not reference .got
directly, the TOC scheme makes .toc
explicit in .o files. Therefore the TOC layout is under control of the compiler and presumably the compiler can leverage better information to optimize the layout for locality. Well, I disagree with this point. The compiler does not know the global information. A linker is better placed to do such link-time optimization.
Let's look at a jump table example.
1 | void puts(const char *); |
If A::foo
is not optimized out, Clang emits:
1 | .section .text._ZN1A3fooEi,"axG",@progbits,_ZN1A3fooEi,comdat |
An .toc entry (not in a group) incorrectly references .rodata._ZN1A3fooEi
in a COMDAT group. This violates an ELF specification rule when .rodata._ZN1A3fooEi
is non-prevailing and therefore discarded:
A symbol table entry with STB_LOCAL binding that is defined relative to one of a group's sections, and that is contained in a symbol table section that is not part of the group, must be discarded if the group members are discarded. References to this symbol table entry from outside the group are not allowed.
Unfortunately this is difficult to fix. We cannot place .toc
in the group. If we do, loading the address of a weak/global symbol in a COMDAT will break similarly.
1 | .text |
GNU ld works around the issue by garbage collecting .toc
entries. Reliance on garbage collection for correctness is a bad design. For LLD, I simply let LLD to ignore a .toc
relocation referencing a discarded symbol. D63182
Well, the above can be fixed by changing .LC0
to a hidden/internal visibility STB_GLOBAL
symbol, but we will get a useless symbol in .symtab
. So PowerPC64 ELFv2's .toc
is prettier than ppc32 .got2
, but that is the pot calling the kettle black.
Text relocations
In "Runtime computation", I mentioned that GOT is not the only approach allowing addresses dependent on runtime computation. The text relocation technique is another. The name is derived from the fact that dynamic relocations apply to text sections.
Traditionally code and read-only data is placed in the same segment, which is called the text segment. The linker uses the criterion !(sh_flags & SHF_WRITE)
to check whether a dynamic relocation is a text relocation. When the output needs text relocations, the linker adds a flag DF_TEXTREL
.
Linker/loader developers often frowned upon text relocations. In https://lore.kernel.org/lkml/CAFP8O3LZ3ZtpkF=RdyDyyXn40oYeDkqgY6NX7YRsBWeVnmPv1A@mail.gmail.com/, I collected some evidence.
Runtime pseudo relocations
On x86, the MinGW runtime supports runtime pseudo relocations, which are conceptually the same as text relocations.
Misc
Myth: Position-dependent code doesn't use GOT.
Not true. To avoid copy relocations and canonical PLT entries, GOT indirection can be used. See -fno-direct-access-external-data
in the copy relocations article. That said, the option is not common yet.
There is a way to convert a symbol lookup (the third case in the very beginning) to the first two cases.
Position-dependent code typically uses direct access relocations to reference a symbol. If the symbol is not defined by the executable,
Appendix
On Windows, an undefined symbol is by default similar to a protected visibility symbol on ELF. Direct access is used. __declspec(dllimport)
enables __imp_$name
which is like an unconditional GOT entry.
1 | __declspec(dllimport) extern int ext_var; |
1 | # x86_64 |
MinGW invented .rdata$.refptr.var
to avoid runtime pseudo relocations, even if a declaration does not specify __declspec(dllimport)
. This is like an enabled-by-default clang -fno-direct-access-external-data
.
1 | movq .refptr.var(%rip), %rax |