Explaining thread local storage

Thread local storage provides a mechanism allocating separate objects for different threads. It is the usual implementation for C11 _Thread_local and C++11 thread_local on ELF platforms.

An example is POSIX errno:

Each thread has its own thread ID, scheduling priority and policy, errno value, floating point environment, thread-specific key/value bindings, and the required system resources to support a flow of control.

Different threads have different errno copies. errno is typically defined as a function which returns a thread local variable.

Representation

Assembler behavior

The compiler usually defines thread-local variables in .tdata and .tbss sections (which have the section flag SHF_TLS). The variables have type STT_TYPE. In GNU as syntax, you can give a the type STT_TLS with .type a, @tls_object.

.section .tbss,"awT",@nobits
.globl a
.type a, @tls_object
a:
  .zero 4
  .size a, .-a

In Clang and GCC produced assembly, thread-local variables are annotated as .type a, @object (STT_OBJECT). When the assembler sees that such symbols are defined in SHF_TLS sections or referenced by TLS specific relocations, STT_OBJECT will be upgraded to STT_TLS.

GNU as supports an obscure directive .tls_common which defines STT_TLS SHN_COMMON symbols. LLVM integrated assembler does not support .tls_common.

Linker behavior

The linker collects SHF_TLS output sections and place them into a PT_TLS program header.

GNU ld treats STT_TLS SHN_COMMON symbols as defined in .tcommon sections. The internal linker script places such sections into the output section .tdata. LLD does not support STT_TLS SHN_COMMON symbols.

Dynamic loader behavior

The dynamic loader collects PT_TLS program headers from the main executable and immediately loaded shared objects (via transitive DT_NEEDED), and allocates static TLS blocks, one block for each PT_TLS.

Models

Local exec TLS model (executable & non-preemptible)

This is the most efficient TLS model. It applies when the TLS symbol is defined in the executable.

The compiler picks this model in -fno-pic/-fpie modes if the variable is

a definition
or a declaration with a non-default visibility.

The first condition is obvious. The second condition is becuase a non-default visibility means the variable must be defined by another translation unit in the executable.

1
2
3

_Thread_local int def;
__attribute__((visibility("hidden"))) extern thread_local int ref;
int foo() { return def + ref; }

1 2	# x86-64 movl %fs:def@TPOFF, %eax

The offset from the thread pointer to the start of a static block is fixed at program start. The offset from the thread pointer to a TLS symbol is usually referred to as the TP offset. You can usually find "TPREL" or "TPOFF" in the relocation type name.

In addition, for the static TLS block of the main executable, the TP offset of a TLS symbol is a link-time constant. The linker and the dynamic loader share the same formula.

Initial exec TLS model (executable & preemptible)

This model is less efficient than local exec. It applies when the TLS symbol is defined in the executable or a shared object available at program start. The shared object can be due to DT_NEEDED or LD_PRELOAD.

The compiler picks this model in -fno-pic/-fpie modes if the variable is a declaration with default visibility. The idea is that a symbol referenced by the executable must be defined by an immediately loaded shared object, instead of a dlopen loaded shared object. The linker enforces this as well by defaulting to -z defs for a -no-pie/-pie link.

1 2	extern thread_local int ref; int foo() { return ref; }

1
2
3

# x86-64
movq ref@GOTTPOFF(%rip), %rax
movl %fs:(%rax), %eax

Because the offset from the thread pointer to the start of a static block is fixed at program start, such an offset can be encoded by a GOT relocation. Such relocation types typically have GOT and TPREL/TPOFF in their names.

If you add the __attribute((tls_model("initial-exec"))) attribute, a thread-local variable can use this model in -fpic mode. If the object file is linked into an executable, everything is fine. If the object file is linked into a shared object, the shared object generally needs to be an immediately loaded shared object. The linker sets the DF_STATIC_TLS flag to annotate a shared object with initial-exec TLS relocations.

glibc ld.so reserves some space in static TLS blocks and allows dlopen on such a shared object if its TLS size is small. There could be an obscure reason for using such an attribute: general dynamic and local dynamic TLS models are not async-signal-safe in glibc. However, other libc implementations may not reserve additional TLS space for dlopen'ed initial-exec shared objects, e.g. musl will error.

General dynamic and local dynamic TLS models (DSO)

The two modes are used when the TLS symbol may be defined by a shared object. They do not assume the TLS symbol is backed by a static TLS block. Instead, they assume that the thread-local storage of the module may be dynamically allocated, making the models suitable for dlopen usage. The dynamically allocated TLS storage is usually referred to as dynamic TLS.

Each TLS symbol is assigned a pair of (module ID, offset from dtv[m] to the symbol), which is usually referred to as a tls_index object. The module ID m is assigned by the dynamic loader when the module (the executable or a shared object) is loaded, so it is unknown at link time. dtv means the dynamic thread vector. Each thread has its own dynamic thread vector, which is a mapping from module ID to thread-local storage. dtv[m] points to the storage allocated for the module with the ID m.

In the simplest form, once we have a pointer to the (module ID, offset from dtv[m] to the symbol) pair, we can get the address of the symbol with the following C program:


void *__tls_get_addr(size_t *v) {
  pthread_t self = __pthread_self();
  return (void *)(self->dtv[v[0]] + v[1]);
}

General dynamic TLS model (DSO & non-preemptible)

The general dynamic TLS model is the most flexible model. It assumes neither the module ID nor the offset from dtv[m] to the symbol is known at link time. The model is used in -fpic mode when the local dynamic TLS model does not apply. The compiler emits code to set up a pointer to the TLSGD entry of the symbol, then arranges for a call to __tls_get_addr. The return value will contain the runtime address of the TLS symbol in the current thread. On x86-64, you will notice that the leaq instruction has a data16 prefix and the call instruction has two data16 (0x66) prefixes and one rex64 prefix. This is a deliberate choice to make the total size of leaq+call to be 16, suitable for link-time relaxation.

data16 leaq def@tlsgd(%rip), %rdi
.value 0x6666
rex64 call __tls_get_addr@PLT
movl (%rax), %eax

At the linker stage, if the TLS symbol does not satisfy relaxation requirements, the linker will allocate two consecutive words in the .got section for the TLSGD relocation, relocated by two dynamic relocations. The dynamic loader will write the module ID to the first word and the offset from dtv[m] to the symbol to the second word. The relocation types are:

arm: R_ARM_TLS_DTPMOD32 and R_ARM_TLS_DTPOFF32
aarch64: R_AARCH64_TLS_DTPMOD and R_AARCH64_TLS_DTPREL (rarely used because TLS descriptors are the default)
i386: R_386_TLS_DTPMOD32 and R_386_TLS_DTPOFF32
x86-64: R_X86_64_DTPMOD64 and R_X86_64_DTPOFF64
mips32: R_MIPS_TLS_DTPMOD32 and R_MIPS_TLS_DTPOFF32
mips64: R_MIPS_TLS_DTPMOD64 and R_MIPS_TLS_DTPOFF64
ppc32: R_PPC_DTPMOD32 and R_X86_64_DTPREL32
ppc64: R_PPC64_DTPMOD64 and R_X86_64_DTPREL64

Local dynamic TLS model (DSO & preemptible)

The local dynamic TLS model assumes that the offset from dtv[m] to the symbol is a link-time constant. This case happens when the TLS symbol is non-preemptible. The compiler emits code to set up a pointer to the TLSLD entry of the module, next arranges for a call to __tls_get_addr, then adds a link-time constant to the return value to get the address. I say "the TLSLD entry of the module" because while (on x86-64) def@tlsld looks like the TLSLD entry of the non-preemptible TLS symbol, it can really be shared by other non-preemptible TLS symbols. So one module needs just one such entry.

1
2
3

leaq def@tlsld(%rip), %rdi
call __tls_get_addr@PLT
movl def@dtpoff(%rax), %edx

At the linker stage, if the TLS symbol does not satisfy local-dynamic to local-exec relaxation requirements, the linker will allocate two consecutive words in the .got section for the TLSLD relocation. The dynamic loader will write the module ID to the first word and the offset from dtv[m] to the symbol to the second word.

If the architecture does not define TLS relaxations, the linker can still made an optimization: in -no-pie/-pie modes, set the first word to 1 (main executable) and omit the dynamic relocation for the module ID.

TLS descriptors

Some architectures have define TLS descriptors as more efficient alternatives to the traditional general dynamic and local dynamic TLS models. Such ABIs repurpose the first word of the (module ID, offset from dtv[m] to the symbol) pair to represent a function pointer. The function pointer points to a very simple function in the static TLS case and a function similar to __tls_get_addr in the dynamic TLS case. The caller does an indirection function call instead of calling __tls_get_addr. There are two main points:

The function call to __tls_get_addr uses the regular calling convention: the compiler has to make the pessimistic assumption that all volatile registers may be clobbered by __tls_get_addr.
In glibc, __tls_get_addr is very complex. If the TLS of the module is backed by a static TLS block, the dynamic loader can simply place the TP offset into the second word and let the function pointer point to a function which simply returns the second word.

The first point is the prominent reason that TLS descriptors are generally more efficient. Arguably traditional general dynamic and local dynamic TLS models could have a mechanism to use custom calling convention for __tls_get_addr as well.

In musl, in the static TLS case, the two words will be set to ((size_t)__tlsdesc_static, tpoff) where __tlsdesc_static is a function which returns the second word. glibc's static TLS case is similar.

.globl __tlsdesc_static
.hidden __tlsdesc_static
__tlsdesc_static:
  # The second word stores the TP offset of the TLS symbol.
  movq 8(%rax), %rax
  ret

The scheme optimizes for static TLS but penalizes the case that requires dynamic TLS. Remember that we have just two words in the GOT and by changing the first word to a function pointer, we have lost information about the module ID. To retain the information, the dynamic loader has to set the second word to a pointer to a (module ID, offset) pair allocated by malloc.

aarch64 defaults to TLS descriptors. On arm, i386 and x86-64, you can select TLS descriptors via GCC -mtls-dialect=gnu2.