-fno-semantic-interposition
2021-05-09 16:00:00 Author: maskray.me(查看原文) 阅读量:234 收藏

The ELF specification says for the STV_DEFAULT visibility, "Global and weak symbols are also preemptable, that is, they may by preempted by definitions of the same name in another component. In many implementations, a defined symbol of any binding in the executable cannot be preempted, but a default visibility STB_GLOBAL or STB_WEAK symbol can be preempted.

It may be a bit surprising that a defined default visibility STB_GLOBAL symbol can be preempted. In the example below, the callee (f) is defined in the same translation unit.

1
2
3

void f() { ... }
void g() { f(); }

GCC's interpretation is: since a -fpic compiled object file can be linked as a shared object, the symbol f is interposable/preemptible at runtime. By default GCC considers the definition inexact and suppresses interprocedural optimizations including inlining.

The emitted assembly looks like:

1
2
3
4
5
6
.globl f
f:

.globl g
g:
call f@PLT # or call f; R_X86_64_PLT32

In -shared mode, the linker notices that f is preemptable and will resolve the branch target to a PLT entry with a dynamic relocation R_*_JUMP_SLOT. Assuming interposition doesn't happen, the ideal behavior is that the branch jumps to the target directly.

The combined compiler and linker behavior causes a performance cost of 5% or more. This is a feature used by 0.01% libraries that penalizes 99.99% libraries. Read on.

On PE-COFF, f cannot be interposed. On Mach-O, f can be interposed only if the dylib is linked with -interposable.

GCC -fno-semantic-interposition

GCC 5 introduced -fno-semantic-interposition to optimize -fpic. First, GCC can apply interprocedural optimizations including inlining like -fno-pic and -fpie. Second, in the emitted assembly, a function call will go through a local alias to avoid PLT if linked with -shared.

1
2
3
4
5
6
7
.globl f
f: # STB_GLOBAL
...
.set f.localalias, f # STB_LOCAL

g:
call f.localalias

If the branch instruction uses a regular STB_GLOBAL symbol: the linker notices that the default visibility f is preemptable in -shared mode and will resolve the branch target to a PLT entry with a dynamic relocation R_*_JUMP_SLOT.

1
2
3
4
5
.globl f
f:

g:
call f@PLT # or call f

If f is a non-definition declaration, -fno-semantic-interposition has no behavior difference.

Clang -fno-semantic-interposition

Longstanding behavior

It turns out that the first merit of the GCC feature "interprocedural optimizations including inlining are applicable" is actually Clang's longstanding behavior for definitions of the external linkage in -fpic code.

When -fsemantic-interposition was contributed by Serge Guelton, I noted that we should keep the aggressive behavior, even if it differs from GCC, not to regress the longstanding optimizations. (ipconstprop, inliner, sccp, sroa treat normal ExternalLinkage GlobalObjects non-interposable.) (Before https://reviews.llvm.org/D72197, MC resolved a PC-relative VK_None fixup to a non-local symbol at assembly time (no outstanding relocation), if the target is defined in the same section. Put it simply, even if IR optimizations failed to optimize and allowed interposition for the function call in void foo() {} void bar() { foo(); }, the assembler would disallow it.)

If a project really requires symbol interposition to work (extremely rare), it may be unhappy with Clang's default behavior. The project should specify -fsemantic-interposition to disable interprocedural optimizations.

dso_local inference in -fpic -fno-semantic-interposition mode

I contributed an optimization to Clang 11: in -fpic -fno-semantic-interposition mode, default visibility external linkage definitions get the dso_local specifier, like in -fno-pic and -fpie modes. For -fpic code, accesses to a dso_local symbol will go through its local alias .Lfoo$local. For -fno-pic and -fpie code, accesses to a dso_local symbol can use the original foo because the object file shall not be linked with -shared.

With dso_local, there are some noticeable behavior differences:

  • variable access: access the local symbol directly instead of going through a GOT indirection
  • function call: call .Lfoo$local
  • taking the address of a function: similar to a variable access

For the previous C example, the emitted assembly will look like the following.

1
2
3
4
5
6
.globl f
f: # STB_GLOBAL
.Lf$local: # STB_LOCAL
...
g:
call .Lf$local

The local alias is a .L symbol. This is deliberate:

  • The assembler suppresses the symbol table entry and converts relocations to reference the section symbol instead.
  • Tools cannot be confused by two symbols at the same location. In the GCC produced object file, currently llvm-objdump will name the function f.localalias.

This behavior change causes the branch target symbol to have a different type: STT_FUNC -> STT_NOTYPE. In some processor supplementary ABI, there may be implications on range extension thunks. ABI makers should be aware of this.

GCC doesn't optimize global variable access. Feature request: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100483. This rarely matters for performance, though.

In Clang cc1, there are three states:

  • -fsemantic-interposition: this represents -fpic -fsemantic-interposition. Don't set dso_local on default visibility external linkage definitions. Emit a module flag metadata SemanticInterposition to disallow interprocedural optimizations.
  • -fhalf-no-semantic-interposition: this represents -fpic without a semantic interposition option. Don't set dso_local on default visibility external linkage definitions. However, interprocedural optimizations on such definitons are allowed.
  • (default): this represents either of -fno-pic, -fpie, and -fpic -fno-semantic-interposition. Set dso_local on default visibility external linkage definitions. Interprocedural optimizations on such definitons are allowed.

Targets

As of Clang 12, -fno-semantic-interposition is only effective on x86.

Hopefully, this optimization will be available on AArch64 (https://reviews.llvm.org/D101873) and RISC-V (https://reviews.llvm.org/D101876).

ThinLTO

I believe ThinLTO -fpic -fno-semantic-interposition works for ThinLTO:) ThinLTO required two changes: if a GlobalVariable is converted to a declaration, we should drop the dso_local specifier (D74749 D74751).

Space overhead

In GCC's foo.localalias scheme, there is an extra symbol table entry (sizeof(Elf64_Sym) = 24) and a string in string table.

In Clang's .Lfoo$local scheme, this generally costs a STT_SECTION symbol table entry (the entry can usually be suppressed).

Can Clang default to -fno-semantic-interposition?

Clang currently has three states. There are some optimization opportunities between the half state and the full -fno-semantic-interposition. It is natural to ask whether we can drop the half state and make -fpic default to the full state.

This is something I'd like Clang to do, but I'll note that there is still some risk. Some points favoring the changed default.

Interprocedural optimizations (including inlining)

For ELF -fpic, Clang never suppresses interprocedural optimizations (including inlining) on default visibility external linkage definitions. So projects relying on blocked interprocedural optimizations have been broken for years. They only probably work recently by specifying -fsemantic-interposition.

Assembler behavior for VK_None

1
2
3
4
5
6
7
8
.globl f
f:
ret

.globl g
g:
bl f # VK_None
ret

Before 2020-01 (https://reviews.llvm.org/D72197), the integrated assembler resolved the fixup when the target symbol and the location are in the same section. There was no relocation, so the linker could not produce a PLT.

Non-x86 targets typically use VK_None for branch instruction. x86 uses VK_PLT for -fpie and -fpic.

If a project passed with -fno-function-sections on aarch64/ppc/etc before 2020-01, we have some confidence that the project does not rely on function semantic interposition.

Difference from -fvisibility=protected

A non-default visibility symbol cannot be preempted, even if the binding is STB_WEAK. -fvisibility=protected can make a weak definition protected. If you want a weak definition to be preemptible, you may need __attribute__((weak,visibility("default"))), which is verbose and error-prone.

ld -shared -Bsymbolic is very similar to -pie. -Bsymbolic can subsume some optimizations of -fno-semantic-interposition.

  • variable access: on x86-64, with R_X86_64_GOTPCRELX/R_X86_64_REX_GOTPCRELX, the GOT indirection can be suppressed. However, the code sequence is still longer than without GOT. On PowerPC64, there is a similar TOC optimization. On other architectures, no difference.
  • function call: call foo@PLT will not create a PLT entry.

-Bsymbolic-functions only applies to STT_FUNC symbols and is generally safer than -Bsymbolic. The main problem with -Bsymbolic is that it doesn't work with copy relocations. (-Bsymbolic can lead to multiple type info objects but that actually works because libsupc++/libc++abi does cough string comparison).

LD_PRELOAD

There are several types of LD_PRELOAD usage.

First, use LD_PRELOAD=same_soname.so to replace a DT_NEEDED entry with the same SONAME. Both -fno-semantic-interposition and -Bsymbolic are compatible with such usage.

Second, use LD_PRELOAD=malloc.so to intercept some functions not defined in the application or any of its shared object dependencies. Both -fno-semantic-interposition and -Bsymbolic are compatible.

1
void *f() { return malloc(0xb612); }

Third, use LD_PRELOAD=different_soname.so to replace a function defined in a shared object dependency and the SONAME is different. Such usage is incompatible with -Bsymbolic. If the function is referenced in its definiting translation unit, the call sites are statically bound with -fno-semantic-interposition; otherwise the usage is still compatible.

Applications

Python. CPython 3.10 sets -fno-semantic-interposition. Red Hat Enterprise Linux 8.2 brings faster Python 3.8 run speeds says there is a huge improvement (up to 30%). This is really an upper bound you can see from real world applications. I think this actually suggests some code problems in CPython.

A small single-digit performance boost (say, 4%) is what I'd normally expect. In https://bugs.archlinux.org/task/70697 and https://bugzilla.redhat.com/show_bug.cgi?id=1956484, I suggest that distributions consider -fno-semantic-interposition and -Bsymbolic-functions when building Clang.


文章来源: http://maskray.me/blog/2021-05-09-fno-semantic-interposition
如有侵权请联系:admin#unsafe.sh