A compact relocation format for ELF

ELF's design emphasizes natural size and alignment guidelines for its control structures. This principle, outlined in Proceedings of the Summer 1990 USENIX Conference, ELF: An Object File to Mitigate Mischievous Misoneism, promotes ease of random access for structures like program headers, section headers, and symbols.

All data structures that the object file format defines follow the "natural" size and alignment guidelines for the relevant class. If necessary, data structures contain explicit padding to ensure 4-byte alignment for 4-byte objects, to force structure sizes to a multiple of four, etc. Data also have suitable alignment from the beginning of the file. Thus, for example, a structure containing an Elf32_Addr member will be aligned on a 4-byte boundary within the file. Other classes would have appropriately scaled definitions. To illustrate, the 64-bit class would define Elf64 Addr as an 8-byte object, aligned on an 8-byte boundary. Following the strictest alignment for each object allows the format to work on any machine in a class. That is, all ELF structures on all 32-bit machines have congruent templates. For portability, ELF uses neither bit-fields nor floating-point values, because their representations vary, even among pro- cessors with the same byte order. Of course the pro- grams in an ELF file may use these types, but the format itself does not.

While beneficial for many control structures, the natural size guideline presents significant drawbacks for relocations. Since relocations are typically processed sequentially, they don't gain the same random-access advantages. The large 24-byte Elf64_Rela structure highlights the drawback. For a detailed comparison of relocation formats, see Exploring object file formats#Relocations.

Furthermore, Elf32_Rel and Elf32_Rela sacrifice flexibility to maintain a smaller size, limiting relocation types to a maximum of 255. This constraint has become noticeable for AArch32 and RISC-V. While the 24-bit symbol index field is less elegant, it hasn't posed significant issues in real-world use cases.

In contrast, the WebAssembly object file format uses LEB128 encoding for relocations and other constrol structures, offering a significant size advantage over ELF.

Inspired by WebAssembly, I will explore real-world scenarios where relocation size is critical and propose an alternative format (RELLEB) that addresses ELF's limitations.

Use cases

Dynamic relocations

A substantial part of position-independent executables (PIEs) and dynamic shared objects (DSOs) is occupied by dynamic relocations. While RELR (a compact relative relocation format) offers size-saving benefits for relative relocations, other dynamic relocations can benefit from a compact relocation format.

ld.lld --pack-dyn-relocs=android was an earlier design that applies to all dynamic relocations at the cost of complexity.

Additionally, Apple linkers and dyld use LEB128 encoding for bind opcodes.

Marker relocations

Marker relocations are utilized to indicate certain linker optimization/relaxation is applicable. While many marker relocations are used scarcely, RISC-V relocatable files are typically filled up with R_RISCV_RELAX relocations. Their size contribution is quite substantial.

`.llvm_addrsig`

On many Linux targets, Clang emits a special section called .llvm_addrsig (type SHT_LLVM_ADDRSIG, LLVM address-significance table) by default to allow ld.lld --icf=safe. The .llvm_addrsig section stores symbol indexes in ULEB128 format, independent of relocations. Consequently, tools like ld -r and objcopy risk invalidate the section due to symbol table modifications.

Ideally, using relocations would allow certain operations. However, the size concern of REL/RELA in ELF hinders this approach. In contrast, lld's Mach-O port chose a relocation-based representation for __DATA,__llvm_addrsig.

`.llvm.call-graph-profile`

LLVM leverages a special section called .llvm.call-graph-profile (type SHT_LLVM_CALL_GRAPH_PROFILE) for both instrumentation- and sample-based profile-guided optimization (PGO). lld utilizes this information ((from_symbol, to_symbol, weight) tuples) to optimize section ordering within an input section description, enhancing cache utilization and minimizing TLB thrashing.

Similar to .llvm_addrsig, the .llvm.call-graph-profile section initially faced the symbol index invalidation problem, which was solved by switching to relocations. I opted for REL over RELA to reduce code size.

`.debug_names`

DWARF v5 accelerated name-based access with the introduction of the .debug_names section. However, in a clang -g -gpubnames generated relocatable file, the .rela.debug_names section can consume a significant portion (approximately 10%) of the file size. This size increase has sparked discussions within the LLVM community about potentially altering the file format for linking purposes.

The availability of a more compact relocation format would likely alleviate the need for such format changes.

Compressed relocations

While the standard SHF_COMPRESSED feature is commonly used for debug sections, its application can easily extend to relocation sections. I have developed a Clang/lld prototype that demonstrates this by compressing SHT_RELA sections.

The compressed SHT_RELA section occupies sizeof(Elf64_Chdr) + size(compressed) bytes. The implementation retains uncompressed content if compression would result in a larger size.

In scenarios with numerous smaller relocation sections (such as when using -ffunction-sections -fdata-sections), the 24-byte Elf64_Chdr header can introduce significant overhead. This observation raises the question of whether encoding Elf64_Chdr fields using ULEB128 could further optimize file sizes. With larger monolithic sections (.text, .data, .eh_frame), compression ratio would be higher as well.


configure-llvm s2-custom0 -DLLVM_TARGETS_TO_BUILD=host -DLLVM_ENABLE_PROJECTS='clang;lld'
configure-llvm s2-custom1 -DLLVM_TARGETS_TO_BUILD=host -DLLVM_ENABLE_PROJECTS='clang;lld' -DCMAKE_{C,CXX}_FLAGS=-Xclang=--compress-relocations=zstd
ninja -C /tmp/out/s2-custom0 lld
ninja -C /tmp/out/s2-custom1 lld

ruby -e 'p Dir.glob("/tmp/out/s2-custom0/**/*.o").sum{|f| File.size(f)}'  
ruby -e 'p Dir.glob("/tmp/out/s2-custom1/**/*.o").sum{|f| File.size(f)}'

Despite the overhead of -ffunction-sections -fdata-sections, the compression technique yields a significant reduction of 14.5%!

However, dropping in-place relocation processing is a downside.

RELLEB relocation format

The 1990 ELF paper ELF: An Object File to Mitigate Mischievous Misoneism says "ELF allows extension and redefinition for other control structures." Inspired by WebAssembly, let's explore RELLEB, a new and more compact relocation format designed to replace RELA. Our emphasis is on simplicity over absolute minimal encoding. See the end of the article for a detailed format description.

A SHT_RELLEB section (preferred name: .relleb<name>) holds compact relocation entries that decode to Elf32_Rela or Elf64_Rela depending on the object file class (32-bit or 64-bit). Its content begins with a ULEB128-encoded relocation count, followed by entries encoding r_offset, r_type, r_symidx, and r_addend.

Here are key design choices:

Relocation count (ULEB128):

This allows for efficient retrieval of the relocation count without decoding the entire section. While a uint32_t (like SHT_HASH) could be used, ULEB128 aligns with subsequent entries, removes endianness differences, and offers a slight size advantage in most cases when the number of symbols can be encoded in one to three bytes.

Delta encoding for r_offset (ULEB128):

Section offsets can be large, and relocations are typically ordered. Storing the difference between consecutive offsets offers compression potential. In most cases, a single byte will suffice. While there are exceptions (general dynamic TLS model of s390/s390x uses a local "out-of-order" pair: R_390_PLT32DBL(offset=o) R_390_TLS_GDCALL(offset=o-2)), we are optimizing for the common case. Switching to SLEB128 would increase the total .o size by 0.1%.

For ELFCLASS32, r_offsets members are calculated using modular arithmetic modulo 4294967296.

Delta encoding for r_type (SLEB128):

Some psABIs utilize relocation types greater than 128. AArch64's static relocation types begin at 257 and dynamic relocation types begin at 1024, necessitating two bytes with ULEB128/SLEB128 encoding in the absence of delta encoding. Delta encoding allows all but the first relocation's type to be encoded in a single byte. An alternative design is to define a base type in the header and encode types relative to the base type, which would introduce slight complexity.

If the AArch32 psABI could be redesigned, allocating [0,64) for Thumb relocation types and [64,*) for ARM relocation types would optimize delta encoding even further.

While sharing a single type code for multiple relocations would be efficient, it would require reordering relocations. This conflicts with order requirements imposed by several psABIs and could complicate linker implementations.

Symbol index and addend presence (SLEB128):

ULEB128 optimizes for the common case when the symbol index is encodable in one or two bytes. Using SLEB128 and delta encoding instead of ULEB128 for the symbol index field would increase the total size by 0.4%. Potential gains for dynamic relocations with consecutive indexes are outweighed by the loss in static relocations and added complexity, hence avoiding delta encoding. We indicate addend omission using the sign bit (see below).

Delta encoding for addend (SLEB128):

Consecutive static relocations often have identical addends, especially on architectures with frequent zero addends (AArch64, PowerPC, RISC-V, etc). Addend omission optimizes this scenario. Additionally, it benefits architectures like AArch32 and x86, which often have identical negative addends (call instructions) for consecutive relocations.

1 2	void f() { g(); g(); g(); ... }

Dynamic relocations (except R_*_RELATIVE and R_*_IRELATIVE) typically have zero addends, also benefiting from our scheme.

The sign bit of the previous member flags whether the addend has changed relative to the prior entry. When the addend changes, we use an addend delta. This offers a slight size advantage (about 0.20%) and optimizes for cases like:

.quad .data + 0x78
.quad .data + 0x80
.quad .data + 0x88
...

Note: when the bitwise NOT code path is taken, the zero delta addend is not utilized.

There is no RELLEB replacement for .rela.plt. In glibc, there is complexity due to __rela_iplt_start.

I have developed a prototype at https://github.com/MaskRay/llvm-project/tree/demo-relleb. RELLEB demonstrates superrior size reduction compared to the SHF_COMPRESSED SHT_RELA approach.

configure-llvm s2-custom2 -DLLVM_TARGETS_TO_BUILD=host -DLLVM_ENABLE_PROJECTS='clang;lld' -DCMAKE_{C,CXX}_FLAGS=-mrelleb

ninja -C /tmp/out/s2-custom2 lld

ruby -e 'p Dir.glob("/tmp/out/s2-custom2/**/*.o").sum{|f| File.size(f)}'

RELLEB yields a significant reduction of 20.5%! The total relocation section size has decreased from 28768176 to 5318839, 18.4% or the original.

It would be interesting to explore the potential gains of combining zstd compression with RELLEB.

configure-llvm s2-custom3 -DLLVM_TARGETS_TO_BUILD=host -DLLVM_ENABLE_PROJECTS='clang;lld' -DCMAKE_{C,CXX}_FLAGS='-mrelleb -Xclang --compress-relocations=zstd'
ninja -C /tmp/out/s2-custom3 lld

ruby -e 'p Dir.glob("/tmp/out/s2-custom3/**/*.o").sum{|f| File.size(f)}'

While the 26.4% reduction in RELLEB section size suggests room for further optimization, the overall decrease of only 1.088% in .o file sizes indicates that the current compact relocation format offers a reasonable compromise. (In the absence of the addend presence and delta addend technique, the overall decrease is about 1.5%.)

I debated whether to name the new section SHT_RELOC (.reloc<name>) or SHT_RELLEB (.relleb<name>). Ultimately, I chose SHT_RELLEB because its unique nature minimizes potential confusion, whereas SHT_RELOC could be confused with SHT_REL and SHT_RELA.

RELLEB for dynamic relocations

RELLEB excels with static relocations, but what about the dynamic case? A truly optimal dynamic relocation format would differ substantially. Since dynamic relocations are often adjacent in offsets by 4 or 8 and include fewer types, a generalized RELR format would be ideal. Here's a possible encoding:

// For R_*_RELATIVE
Encode(RELR bytes)
Encode(R_*_RELATIVE)
RELR

// For R_*_GLOB_DAT or the symbolic relocation type
Encode(R_*_GLOB_DAT)
Use RELR to encode offsets
Encode symbol indexes separately

// For R_*_JUMP_SLOT
Encode(R_*_JUMP_SLOT)
Use RELR to encode offsets
Encode symbol indexes separately

While RELLEB is a step up from REL/RELA for dynamic relocations, it's not perfect. Android's packed relocation format, despite its complexity and lack of bitmap encoding, provides greater space savings.

I've implemented -z relleb in my prototype for testing. We need just one dynamic tag: DT_RELLEB. DT_RELLEBSZ is not needed, because the relocation count can be inferred from the header.

Let's link clang-16-debug using several methods and compare: RELA, --pack-dyn-relocs=relr, --pack-dyn-relocs=android+relr, and --pack-dyn-relocs=relr -z relleb.

% llvm-readelf -S clang | grep ' \.rel.*\.'
  [ 8] .rela.dyn         RELA            0000000000f28998 f28998 e177d0 18   A  3   0  8
  [ 9] .rela.plt         RELA            0000000001d40168 1d40168 0025b0 18  AI  3  26  8
% llvm-readelf -S clang.relr | grep ' \.rel.*\.'
  [ 8] .rela.dyn         RELA            0000000000f28998 f28998 016e90 18   A  3   0  8
  [ 9] .relr.dyn         RELR            0000000000f3f828 f3f828 02ccd0 08   A  0   0  8
  [10] .rela.plt         RELA            0000000000f6c4f8 f6c4f8 0025b0 18  AI  3  27  8
% llvm-readelf -S clang.relr+android | grep ' \.rel.*\.'
  [ 8] .rela.dyn         ANDROID_RELA    0000000000f28998 f28998 001707 01   A  3   0  8
  [ 9] .relr.dyn         RELR            0000000000f2a0a0 f2a0a0 02ccd0 08   A  0   0  8
  [10] .rela.plt         RELA            0000000000f56d70 f56d70 0025b0 18  AI  3  27  8
% llvm-readelf -S clang.relr+relleb | grep ' \.rel.*\.'
  [ 8] .relleb.dyn       0x14: <unknown> 0000000000f28998 f28998 003bcd 01   A  3   0  8
  [ 9] .relr.dyn         RELR            0000000000f2c568 f2c568 02ccd0 08   A  0   0  8
  [10] .rela.plt         RELA            0000000000f59238 f59238 0025b0 18  AI  3  27  8

Analysis

RELR significantly optimizes relative relocations, offering the largest size reduction.
RELLEB further improves the non-relative portion, achieving a decent 16.3% reduction. However, it's overshadowed by Android packed relocations (6.3%).
While Android packed relocations have a smaller footprint, their advantage is less pronounced since .relr.dyn still accounts for a significant portion of the size.

RELLEB proposal for the generic ABI

In https://www.sco.com/developers/gabi/latest/ch4.sheader.html, make the following changes.

In Figure 4-9: Section Types,sh_type, append a row

SHT_RELLEB | 20

Add text:

SHT_RELLEB - The section holds compact relocation entries with explicit addends. An object file may have multiple relocation sections. ''Relocation'' below for details.

In Figure 4-16: Special Sections, append a row

.rellebname | SHT_RELLEB | see below

Change the text below:

.relname, .relaname, and .rellebname

These sections hold relocation information, as described in ''Relocation''. If the file has a loadable segment that includes relocation, the sections' attributes will include the SHF_ALLOC bit; otherwise, that bit will be off. Conventionally, name is supplied by the section to which the relocations apply. Thus a relocation section for .text normally would have the name .rel.text, .rela.text, or .relleb.text.

In Figure 4-23: Relocation Entries, add:

typedef struct {
  Elf32_Addr r_offset;
  Elf32_Word r_type;
  Elf32_Word r_symidx;
  Elf32_Sxword r_addend;
} Elf32_Relleb;

typedef struct {
  Elf64_Addr r_offset;
  Elf64_Word r_type;
  Elf64_Word r_symidx;
  Elf64_Sxword r_addend;
} Elf64_Relleb;

Add text above "A relocation section references two other sections":

A SHT_RELLEB section holds compact relocation entries that decode to Elf32_Relr or Elf64_Relr depending on the object file class (32-bit or 64-bit). Its content begins with a ULEB128-encoded relocation count, followed by entries encoding r_offset, r_type, r_symidx, and r_addend. Note that the r_info member in traditional REL/RELA formats has been split into separate r_type and r_symidx members, allowing uint32_t relocation types for ELFCLASS32 as well.

Delta offset: (ULEB128-encoded) The difference in r_offset relative to the previous entry. The difference is truncated to 32-bit/64-bit for ELFCLASS32/ELFCLASS64, respectively.
Delta type: (SLEB128-encoded) The difference in relocation type relative to the previous entry.
Symbol table index and addend presence: (SLEB128-encoded) If the r_offset is equal to the previous r_addend, the encoded value represents the symbol index; otherwise, the bitwise NOT of the encoded value indicates the symbol index.
Delta addend: (SLEB128-encoded, only if indicated in the previous member) The difference in r_addend relative to the previous entry. The difference is truncated to 32-bit/64-bit for ELFCLASS32/ELFCLASS64, respectively.

Example C++ encoder for Elf64_Relleb:

uint64_t offset = 0, addend = 0;
uint32_t type = 0;
encodeULEB128(relocs.size(), os);
for (const Reloc &rel : relocs) {
  encodeULEB128(rel.offset - offset, os);
  encodeSLEB128(static_cast<int32_t>(rel.type - type), os);
  type = rel.type;
  if (addend == rel.addend) {
    encodeSLEB128(rel.symidx, os);
  } else {
    encodeSLEB128(~symidx, os);
    encodeSLEB128(rel.addend - addend, os);
    addend = rel.addend;
  }
}

For the first relocation entry, the previous offset, type, and addend members are treated as zero.

In Figure 5-10: Dynamic Array Tags, d_tag, add:

DT_RELLEB | 38 | d_ptr | optional | optional

Add text below:

DT_RELLEB - This element is similar to DT_RELA, except its table uses the RELLEB format.