Toolchain testing

Classification

Testing levels.

Unit testing

Test an individual function, class, or module.

In llvm-project, unittest/ directories (e.g. llvm/unittest/ and clang/unittest/) have such tests.

A homebrew container has interfaces similar to a STD container.
The target triple parser can handle a triplet.
An IR mutator performs the expected operation.
The source file formatter can handle some JavaScript syntax.
A code completer can complate some stuff.
A semantic analysis function can constant evaluate an expression.

Unit testing typically has a high setup cost for the environment.

Integration testing

Multiple modules are combined and tested as a group.

With good modularity, many modules have intuitive and expected behaviors. Many integration tests may look similar to unit tests.

In llvm-project, many lit tests belong to this category. They may test:

the Clang driver can detect a GCC installation.
a Clang driver option passes a option to cc1.
a cc1 option can cause Clang's IR generator to emit something.
an instrumentation pass can instrument a particular function.
an optimization pass does expected transformation.
the code generator can translate an LLVM IR instruction.
the code generator can optimize a machine instruction in a certain way.
the integrated assembler can parse an instruction and encode it correctly.
a linker option performs the expected operation.
readelf can dump a section in an object file.
a debugger command performs the expected operation.

System testing

Test the entire system. I consider that runtime tests belong to this category because they exercise many modules of the system.

An environment variable can change a sanitizer behavior.
A C++ source file can be compiled with debug information coverage above a certain threshold.
A pthread mutex implementation behaves correctly in a case of three threads.
A C source file with a certain floating point function has the expected output.
A linker can link Clang.
A Clang built by GCC can build itself.
A 2-stage Clang is identical to a 3-stage Clang. Build reproducibility.
A debugger can handle some C++ syntax emitted by a compiler.
A build system supports a particular build configuration.
A piece of software runs on a particular OS.

Another way of classification.

Regression testing

A bug is fixed or a behavior is changed. Can we develop a test to distinguish the new state from the previous state?

Performance testing

Measure the build speed, intermediate file (usually object files) sizes, latency, throughput, time/memory/disk usage, CPU/network utilization, etc.

Many metrics can be refined arbitrarily.

Oh, how is network relevant in toolchain testing? Think of a debug client talking to a debug server.

Common problems

I don't know whether a test is needed

When in doubt, the answer is usually yes. Bugfixes and behavior changes are usually interesting.

I don't know where to add a test

The documentation should make it clear how to run the entire testsuite. A good continuous integration system can improve testsuite discoverability.

Many projects have a significant barrier to entry because they may require a very specific environment which is difficult to set up. Not-so-good problems:

you need to install XXX container technology.
if you use Debian, it is difficult to install XXX.
you need to install A, B, and C in a virtual machine.
if you test on Debian, you likely get more failures than on Fedora.

In llvm-project, if you change llvm/, run ninja check-llvm; if you change clang/, run ninja check-clang. Some cross-toplevel-directory dependencies may be less clear. If you remove a function from clang/include/clang/ or change a function signature, you may need ninja check-clang-tools. If you change optimization/debug information generation/code generation, you may need check-clang check-flang for a small number of code generation tests.

I don't know an existing test can be used

Adding a new file is much easier than finding the right file for extension. A new contributor may be impatient at understanding the organization or reading existing tests. A maintainer can suggest the right file for extension, if reusing a file is the right call.

When can reusing an existing test be better?

The file also exercises the behavior but just misses some behavior checks.
The changed behavior is related to an option, if placed side by side with another set of invocations, can improve readability/discoverability/etc.

Sometimes, an existing test can be found insufficient. Adding a generalized test can enable deletion of existing tests.

The test checks too little

A reader may have difficulty understanding the intention of the test.

The test may not reliably test the behavior change or bug fix. It may become stale and irrelevant as soon as the modified functions/modules are slightly changed.

Scroll down for antipatterns in some GNU toolchain projects.

The test checks too much

It can harm readability.

The test may incur frequent updates for quite unrelated changes. The person doing that unrelated change needs to waste a few seconds/minutes on updating the test.

The test checks at the wrong layer

Examples

binutils-gdb

In binutils/testsuite/lib/binutils-common.exp, run_dump_test is the core that runs as, ld, and checks the output of a dump program matches the expected patterns in the .d file.

Let's take a whirlwind tour and figure out the numerous problems.

First, test discoverability. AFAIK, make check-ld, make check-gas, make check-binutils are not documented. make check runs many tests which a contributor may not care.

Many test files do not have descriptive names. For example, there are ld/testsuite/ld-elf/group{1,2,3a,3b,4,5,6,7,8a,8b,9a,9b,10,12,12}.d. We can tell from the filenames they are related to section groups, but we cannot tell what an individual file is.

If you open an arbitrary group*.d file, it is difficult to know its intention. Either there is no comment, or the comment just explains why it excludes execution on some targets.

% cat ld/testsuite/ld-elf/group2.d







  \[[ 0-9]+\] \.group[ \t]+GROUP[ \t]+.*

  \[[ 0-9]+\] \.text.*[ \t]+PROGBITS[ \t0-9a-f]+AXG.*

  \[[ 0-9]+\] \.data.*[ \t]+PROGBITS[ \t0-9a-f]+WAG.*

COMDAT group section \[[ 0-9]+\] `\.group' \[foo_group\] contains . sections:
   \[Index\]    Name
   \[[ 0-9]+\]   .text.*
#...
   \[[ 0-9]+\]   .data.*
#pass

Unfortunately git log gives very little clue, because many commits touching these files don't document the intention in their commit messages.

commit 6a0d0afdc7ca5c7e0605ede799e994c98d596644
Author: hidden
Date:   Thu Oct 20 10:06:41 2005

    binutils/testsuite/
    
    2005-10-20  hidden
    
        PR ld/251
        * binutils-all/group.s: New file.
    
        * binutils-all/objcopy.exp (objcopy_test_readelf): New
        procedure.
        Use it to test ELF group.
    
    ld/testsuite/
    
    2005-10-20  hidden
    
        PR ld/251
        * ld-elf/group.2d: New file.

Now you probably find more problems. There is no way running multiple ld tests in one file. If you want to alter the command line options a bit, you have to create a new test file and add some lines in an .exp file.

The test uses \[[ 0-9]+\] to match section indexes. To utilitize the test better, you may want to test that the section group indexes match there indexes in the section header table. Unfortunately run_dump_test provides no regex capture functionality.

You probably notice that some readelf tests are somehow placed in the ld/testsuite directory.

check-ld is slow. How do I run just one test file? There is an obscure way: make check-ld RUNTESTFLAGS=ld-elf/shared.exp.