Why isn't ld.lld faster

Recently Rui Ueyama released mold 1.0 and people wonder why with multi-threading its ELF port is faster than LLD's ELF port (ld.lld). This article makes an in-depth analysis of ld.lld's performance.

First of all, I am very glad that Rui Ueyama started the project. Our world has a plethora of compilers, but not many people learn or write linkers. As its design documentation says, there are many drastically different designs which haven't been explored. In my view, mold is innovative in that it introduced parallel symbol table initialization, symbol resolution, and relocation scan which to my knowledge hadn't been implemented before, and showed us amazing results. The innovation gives existing and future linkers incentive to optimize further.

ld.lld

Previously people advocated "ld.lld focuses on speed". This impression was mainly driven by the comparison with GNU ld and gold. As I have observed, the focus is probably more on code readability, parity on important features, ideal semantics, portability and generality, but less on speed. In many cases, less code, better performance. Next, I will describe performance significant options and rudimentary performance analysis.

Performance significant options

Different programs have different link characteristics. For the same program, we can get vastly different results with different compiler and linker options. So the percentage in this article is very rough, just indicating the few applications that I tested. YMMV.

The most significant compiler options for link time are:

-O: the input size matters.
-g: the debug information size matters
-gz: zlib compressed debug sections require uncompression on the linker side
-ffunction-sections: it is often used with -fdata-sections. -fno-unique-section-names may help.

The most significant linker options for link time are:

(LLD specific) -O0 (-Wl,-O0 in compiler drivers) disables SHF_MERGE duplicate elimination. This can tremendously decrease link time for a program with large .debug_str, but you will get a huge .debug_str as the penalty.
-z nocombreloc. Don't sort dynamic relocations in .rela.dyn. This is a large bottleneck when you have millions of relocations in a mostly statically linked executable. Well, this is the cost of PIE.
--icf={safe,all}: the section-based deduplication requires extensive computation on contents and relocations.
--gc-sections
--build-id
--compress-debug-sections=zlib: compressing output debug sections

In a large executable with no debug information, I measured the following with ld.lld --gc-sections --time-trace:

32% Parse input files
15% Write output
14% Scan relocations
12% --gc-sections
5% Load input files
2% Merge/finalize input sections

In a large executable with debug information, I measured the following:

64% Merge/finalize input sections
18% Write output file
5% Parse input files
5% Split sections

TODO: more examples

You can map these metrics to the generic linker passes.

Generic ELF linker passes

Here are the passes a typical linker has.

Parse command line options
Find and scan input files (.o, .so, .a), interleaved with symbol resolution
Global transforms (section based garbage collection, identical code folding, etc)
Create synthetic sections
Map input sections and synthetic (linker-generated) sections into output sections
Scan relocations
Layout and assign addresses (thunks, sections with varying sizes, symbol assignments)
Write headers
Copy section contents to output and resolve relocations

Now let me do an in-depth analysis of various passes in LLD.

Find and load input files

For each input file, ld.lld calls open and mmap, then categories it by scanning the initial bytes (magic). The process is serial and synchronous. In the absence of debug information, this pass may take some time (5%).

lld-link does asynchronous file load which can speed up a bit. Parallelizing this pass can improve the performance a little, but it may cost the readability a bit.

Parse input and perform symbol resolution

An input file can be a relocatable object file, a shared object, an archive, a lazy object file, an LLVM bitcode file, or a linker script.

Relocatable object files

A relocatable object file contributes symbols, sections, and relocations to the link process. ld.lld reads sections and makes in-memory representations of sections, then reads the symbol table. In a symbol table, all symbols with STB_LOCAL binding precede the weak and global symbols. The local part is file-local while the non-local part requires resolution with other files.

The non-local part consists of STB_GLOBAL, STB_WEAK, and (GCC specific) STB_GNU_UNIQUE symbols. An entry may be defined (Defined) or undefined (Undefined). ld.lld inserts each entry to the global symbol table and resolves the entry.

Shared objects

A shared object contributes just symbols to the link process. For a shared object, ld.lld reads its dynamic symbol table (SHT_DYNSYM) which consists of undefined symbols (represented identically to an undefined in a relocatable object file) and defined symbols (represented as a Shared).

Symbol resolution

TODO: move some part to "how to implement an ELF linker"

The resolution between two .o defined/undefined symbols is associative. In the absence of weak symbols and duplicate definition errors, the interaction is also commutative.

The resolution between a .o and a .so is commutative and associative.

The resolution between two .so is associative. If two .so don't define the same symbol, the resolution is also commutative.

For an archive symbol, its interaction with any other file kind is neither commutative nor associative.

When linking a large program with little debug information, ld.lld --time-pass shows some symbol resolution passes which take varying time from 0.x~3%. Anyhow, an empty global symbol table iteration takes 0.x% time. In the presence of --dynamic-list and --version-script, the upper bound can go higher because wildcard pattern matching can be slow. Some passes are related to ideal symbol resolution semantics which are not needed by 99.9% software in the wild: demote shared symbols/archives, parse symbol versions, redirect symbols (for edge-case --wrap and unversioned symbol elimination).

I believe mold handles fewer cases and will take similar performance hits if the tricky cases are handled.

What can be parallelized

"Parse input and perform symbol resolution" is typically the largest bottleneck in a link and can be parallelized in three places.

initialization of local symbols (embarrassingly parallel)
initialization of non-local symbols
symbol resolution

Parallel initialization of local symbols

Initialization of local symbols is embarrassingly parallel. As of 13.0.0, ld.lld called make<Defined> or make<Undefined> for each symbol entry where make is backed up by the non-thread-safe BumpPtrAllocator. I recently changed the code to batch allocate SymbolUnion. Is this part ready to be parallelized? Not yet.

To support linker scripts, a defined symbol may be relative to either an output section or an input section. The symbol representation therefore has a pointer struct member section, which means we need to have the section representation first. To diagnose relocations referencing a local symbol defined in a discarded section (due to COMDAT rules), a local symbol may be represented as Undefined. This needs the section pointer and COMDAT group resolution to decide whether a local symbol should be Defined or Undefined.

For relocatable object files, ld.lld initialize sections before symbols. However, for archives and lazy object files, ld.lld does not parse their sections. In a large link, most files are usually archives and lazy object files, so parallelizing just relocatable object files would be of little help.

One idea is to eagerly parse archive and lazy object files, and forget that an archive may have an index. This will make the linker more prepared for parallelism with a hit to the single-threading performance.

If we do parallel initialization, the section member may need to be fixed later. The symbol kind may change from Defined to Undefined as well after COMDAT group resolution. Note that the symbol kind may not need to be explicit through C++ class inheritance, though ld.lld currently does this and is different to change at this point due to the numerous references.

Parallel initialization of non-local symbols

If both a.o and b.o have a non-local symbol foo, they need to refer to the same entry in the global symbol table. Conceptually an object file needs a data structure like vector<Symbol *> symbols;. In the database world, this is a hash join or sort merge join problem. Every linker implementation uses a hash join.

ld.lld represents the global symbol table with llvm::DenseMap<llvm::CachedHashStringRef, int> symMap; std::vector<Symbol *> symVector;. For each symbol table entry, ld.lld reserves a symbol in the global symbol table if it does not exist yet. If we had a concurrent hash table, the insertion could obviously be parallelized. In April 2021, there was a llvm-dev thread "Concurrent Hashmap?" but we haven't taken any action.

mold uses a concurrent hash table from Intel oneTBB.

Parallel symbol resolution

mold uses parallel symbol resolution. I previously mentioned that the resolution of archive symbols is not associative, therefore parallelism is difficult. For a.a b.so, in GNU ld, gold, and ld.lld:

if the a.a member is extracted, it will become a regular object definition and override b.so instead.
if the a.a member is not extracted, the b.so definition wins.

mold seems to be willing to break the compatibility. I know 99.99% software don't leverage this point, but personally I think the 0.01% case is important and indicates a symbol resolution ideal I want to keep up with.

mold assigns lower priorities to archive symbols than shared symbols. The b.so definition wins, regardless of the position of a.a.

According to mold --perf, the symbol initialization and resolution performance is like the following:

1 thread: 1.3x slower than ld.lld (atomics have costs)
2 threads: 1.75x faster than 1 thread
4 threads: 3x faster than 1 thread
8 threads: 5.3x faster than 1 thread
16 threads: 5.4x faster than 1 thread

Global transforms

`--gc-sections`

See Linker garbage collection. For a -ffunction-sections -fdata-sections build, GC can take little time (say, 2%) if there is debug information or a lot (10%) when there is little debug information. .debug_* sections are large and monolithic (fewer than .text.* and .data.*) and take significant write time, so they dilute the proportion of GC time.

The code is straightforward to parallelize, unfortunately llvm-project does not provide a parallel producer–consumer framework like oneTBB [algorithms.parallel_for_each.feeder]. Once the library is available, it is not difficult to parallelize this part.

`--icf`

If enabled, the pass may take 3~15% time. It is more expensive than the current serial --gc-sections.

That said, ld.lld's algorithm is pretty good. It computes a lightweight message digest for each section, then collects hashes into buckets. For each bucket, it does pairwise comparison which is easy to reason about and expensive computation can be lazy.

Another idea is to use message digests to fully describe a section (mold). The message digest can be more expensive to compute as the pairwise comparison approach may skip many condition branches.

My test shows that mold's algorithm is sometimes slower, for example when linking Clang.

Scan relocations

An input section may have a dependent SHT_REL/SHT_RELA. ld.lld scans relocation records to decide what actions are taken:

link-time constants: the location should be updated in-place
GOT: .got and .rela.dyn need an update
PLT: .plt, .got.plt, and .rela.plt need an update
copy relocations, canonical PLT entries
TLS dynamic relocations
report a diagnostic

For single-threading, the speed of mold's relocation scan is 1.7~2.3x of ld.lld's. I think fundamentally it is because ld.lld does more work. It has more dispatches and abstraction costs (e.g. virtual function), and provides more diagnostics and handles many tricky cases that mold doesn't do:

support all of REL/RELA/mips64el for one architecture (even if the architecture typically doesn't use REL)
use a generic relocation expression (RelExpr) design which costs two virtual functions processing one relocation
handle non-preemptible GNU indirect function
retain otherwise unused .got and .got.plt for certain relocation types
Hexagon/Mips/PowerPC hacks
Unfortunate Fortran COMMON symbol semantics (only needed by PowerPC64 AFAIK)

Every check has a small cost. Many a little makes a mickle.

I have a patch refactoring the way ld.lld does relocations but it may increase lines of code a lot: https://reviews.llvm.org/D113228. I do now know whether it will be a right trade-off.

Another thing worth pointing out is that this process passes some relocations (InputSectionBase::relocations) which are needed by subsequent passes like range extension thunks. Many input relocations are currently pass-through, so we pay push_back time and memory. If we speculately think most relocations will be pass-through we can omit InputSectionBase::relocations, but subsequent passes will take input which is more difficult to deal with.

Except for the constraints mentioned above, relocation scan is conceptually straightforward to parallelize. Local symbols do not need communication while non-local symbols just need atomic states.

My https://reviews.llvm.org/D114783 is an important refactoring in ld.lld's relocation scan. To enable parallelism, there is still much to do. We would lightly have to exclude some architectures needing special treatment Mips/PowerPC. With other constraints it may or may not be a good trade-off.

For a random large application, I measured the following for mold:

1 thread: 2.3x faster than ld.lld
2 threads: 1.88x faster than 1 thread
4 threads: 3.28x faster than 1 thread
16 threads: 11.13x faster than 1 thread

On the other hand, with little debug information, relocation scan takes 15% total time. IMO parallelism does not pay off if we don't optimize other critical paths to stand out the 15% time. I raised this point during the 2021 LLVM Dev Meeting round table "LLD for Mach-O".

Map input sections and synthetic sections into output sections

To support linker scripts, ld.lld's model was designed for output section descriptions and input section descriptions. A typical scenario where a user complains about link performance does not involve a linker script. Nevertheless, the mapping process takes a slight performance hit as we reuse the framework in the absence of a SECTIONS command. For a large executable with little debug information "Assign sections" took 2% time.

If we want to optimize this case, we will need to introduce a separate code path which will lead to implementation complexity.

`SHF_MERGE` duplicate elimination

A section with the SHF_MERGE|STRINGS flags has data elements consisting of NUL-terminated strings. In a program with debug information, .debug_str (with the SHF_MERGE|STRINGS flags) may be the hottest pass. A section with just the SHF_MERGE flag has data elements of a uniform size. In either case, duplicate elements can be eliminated as an optional optimization. ld.lld performs the optimization with (the default) -O1 and above.

ld.lld splits the section content into pieces, collects pieces from all sections, then eliminates duplicates with a hash table. An obvious idea is to use a concurrent hash table. As said previously, llvm-project does not provide one. ld.lld duplicates work and does the following instead:

size_t concurrency = PowerOf2Floor(numThreads, 32);
for each thread {
  for (MergeInputSection *sec : sections)
    for (auto &piece : sec->pieces)
      if (piece.live && (piece.shardId & (concurrency-1)) == threadId)
        shards[piece.shardId].add(piece.data);
}


size_t off = 0;
for (size_t i = 0; i < numShards; ++i) {
  shards[i].finalizeInOrder();
  if (shards[i].getSize() > 0)
    off = alignTo(off, alignment);
  shardOffsets[i] = off;
  off += shards[i].getSize();
}
size = off;

parallelForEach(sections, [&](MergeInputSection *sec) {
  for (auto &piece : sec->pieces)
    if (piece.live)
      piece.outputOff += shardOffsets[getShardId(piece.hash)];
});

It would be nice if we could explore alternative algorithms.

Open output file

For -o out, ld.lld unlinks out asynchronously, creates out in read-write mode, and resizes out to the output size. The output size is computed in the writer and has to wait after output section addresses are assigned.

I recently noticed that the way LLVM's Support library resizes the file is problematic for Linux tmpfs. The API calls posix_fallocate which has a nice property of reporting ENOSPC for some filesystems in some operating systems but may be bad for tmpfs. See https://reviews.llvm.org/D115957. In my system, posix_fallocate on an 1.4GiB output costs 0.3s, which is significant.

Write sections

Some output sections contain one single synthetic section. I plan to analyze some expensive sections more carefully. For my experiments this week I use the additional time tracing (https://reviews.llvm.org/D115984) a lot.

I found that .rela.dyn sorting (-z combreloc) took 10+% time for a position-independent executable with millions of relocations. I pushed a trivial parallel commit and have a pending patch to optimize it further: https://reviews.llvm.org/D115993. This is a great example demonstrating that generality and feature creep bring us constraints and abstraction costs:

outputSec is in the dynamic relocation representation because of Mips.
We want to support all of REL/RELA/mips64el. We value code brevity, so manual loop unswitching would not be OK.
We support LLD's partition feature, so obtaining the symbol index has an extra condition which adds a little cost.
We support Android packed relocations which have very different encoding schemes, so refactoring encodeDynamicReloc needs to touch it.

`--compress-debug-sections=zlib`

TODO

The format has a restriction on zlib. llvm-project does not use a parallel implementation.

`--build-id`

According to https://fedoraproject.org/w/index.php?title=RolandMcGrath/BuildID&oldid=16098, the feature was added as approximation of true uniqueness across all binaries that might be used by overlapping sets of people". ld.lld's --build-id feature, as I know, has no validation tool. ld.lld supports some hash functions available in llvm-project and uses a parallel hash tree algorithm. Some library code is not optimal.

Executable size

TODO

New hope?

With landed and pending patches I have made this week, ld.lld --threads={2,4,8} linking tmpfs output seems to be ~11% faster (both clang RelWithDebInfo and a large app with no debug info) compared with 13.0.0.

Epilogue

With multi-threading, why is mold faster than ld.lld? Ordered by my rough estimate of importance:

SHF_MERGE duplicate elimination
parallel symbol table initialization
eager loading of archive members and lazy object files
parallel relocation scan
faster synthetic sections
parallel --gc-sections
parallel symbol resolution

ld.lld has some parallelism which mostly resides in writing sections and a few synthetic sections. mold brings more parallelizable passes to the table. This is good.

For ld.lld, the lack of llvm-project data structures and parallel facility reduce some design space we can make. This can be fixed.

Features and generality require abstraction. Abstraction has costs. Costs can be tangible (performance) and intangible (reduced design space). For symbol table initialization, symbol resolution, relocation scan, I believe ld.lld has found a good balancing among different trade-offs. These passes have been ruthlessly challenged by mold. However, with various constraints they may be difficult to change. Perhaps the archive semantics and COMDAT designers had not anticipated that after some decades there are linker developers struggling with finding a good balance and compromise:) Perhaps 99% people will not feel sympathy for the serial choice ld.lld makes.