Binary type inference in Ghidra

By Ian Smith

Trail of Bits is releasing BTIGhidra, a Ghidra extension that helps reverse engineers by inferring type information from binaries. The analysis is inter-procedural, propagating and resolving type constraints between functions while consuming user input to recover additional type information. This refined type information produces more idiomatic decompilation, enhancing reverse engineering comprehension. The figures below demonstrate how BTIGhidra improves decompilation readability without any user interaction:

Figure 1: Default Ghidra decompiler output

Figure 2: Ghidra output after running BTIGhidra

Precise typing information transforms odd pointer arithmetic into field accesses and void* into the appropriate structure type; introduces array indexing where appropriate; and reduces the clutter of void* casts and dereferences. While type information is essential to high-quality decompilation, the recovery of precise type information unfortunately presents a major challenge for decompilers and reverse engineers. Information about a variable’s type is spread throughout the program wherever the variable is used. For reverse engineers, it is difficult to keep a variable’s dispersed usages in their heads while reasoning about a local type. We created BTIGhidra in an effort to make this challenge a thing of the past.

A simple example

Let’s see how BTIGhidra can improve decompiler output for an example binary taken from a CTF challenge called mooosl (figure 3). (Note: Our GitHub repository has directions for using the plugin to reproduce this demo.) The target function, called lookup, iterates over nodes in a linked list until it finds a node with a matching key in a hashmap stored in list_heads.¹ This function hashes the queried key, then selects the linked list that stores all nodes that have a key equal to that hash. Next, it traverses the linked list looking for a key that is equal to the key parameter.

Figure 3: Linked-list lookup function from mooosl

The structure for linked list nodes (figure 4) is particularly relevant to this example. The structure has buffers for the key and value stored in the node, along with sizes for each buffer. Additionally, each node has a next pointer that is either null or points to the next node in the linked list.

Figure 4: Linked list node structure definition

Figure 5 shows Ghidra’s initial decompiler output for the lookup function (FUN_001014fb). The overall decompilation quality is low due to poor type information across the function. For example, the recursive pointer next in the source code causes Ghidra to emit a void** type for the local variable (local_18), and the return type. Also, the type of the key_size function parameter, referred to as param_2 in the output, is treated as a void* type despite not being loaded from. Finally, the access to the global variable that holds linked list head nodes, referred to as DAT_00104010, is not treated as an array indexing operation.

Figure 5: Ghidra decompiler output for the lookup function without type inference.
Highlighted red text is changed after running type inference.

Figure 6 shows a diff against the code in figure 5 after running BTIGhidra. Notice that the output now captures the node structure and the recursive type for the next pointer, typed as struct_for_node_0_9* instead of void**. BTIGhidra also resolves the return type to the same type. Additionally, the key_size parameter (param_2) is no longer treated as a pointer. Finally, the type of the global variable is updated to a pointer to linked list node pointers (PTR_00104040), causing Ghidra to treat the load as an array indexing operation.

Figure 6: Ghidra decompiler output for the lookup function with type inference.
Highlighted green text was added by type inference.

BTIGhidra infers types by collecting a set of subtyping constraints and then solving those constraints. Usages of known function signatures act as sources for type constraints. For instance, the call to memcmp in figure 5 results in a constraint on param_2 declaring that param2 must be a subtype size_t. Notice in the figure that BTIGhidra also successfully identifies the four fields used in this function, while also recovering the additional fields used elsewhere in the binary.

Additionally, users can supply a known function signature to provide additional type information for the type inference algorithm to propagate across the decompiled program. Figure 6 demonstrates how new type information from a known function signature (value_dump in this case) flows from a call site to the return type from the lookup function (referred to as FUN_001014fb in the decompiled output) in figure 5. The red line depicts how the user-defined function signature for value_dump is used to infer the types of field_at_8 and field_at_24 for the returned struct_for_node_0_9 from the original function FUN_001014fb. The type information derived from this call is combined with all other call sites to FUN_001014fb in order to remain conservative in the presence of polymorphism.

Figure 7: Back-propagation of type information derived from value_dump function signature

Ultimately, BTIGhidra fills in the type information for the recovered structure’s used fields, shown in figure 8. Here, we see that the types for field_at_8 and field_at_24 are inferred via the invocation of value_dump. However, the fields with type undefined8 indicate that the field was not sufficiently constrained by the added function signature to derive an atomic type for the field (i.e., there are no usages that relate the field to known type information); the inference algorithm has determined only that the field must be eight bytes.

Figure 8: Struct type information table for decompiled linked list nodes

Ghidra’s decompiler does perform some type propagation using known function signatures provided by its predefined type databases that cover common libraries such as libc. When decompiling the binary’s functions that call known library functions, these type signatures are used to guess likely types for the variables and parameters of the calling function. This approach has several limitations. Ghidra does not attempt to synthesize composite types (i.e., structs and unions) without user intervention; it is up to the user to define when and where structs are created. Additionally, this best-effort type propagation approach has limited inter-procedural power. As shown in figure 9, Ghidra’s default type inference results in conflicting types for FUN_1014fb and FUN_001013db (void* versus long and ulong), even though parameters are passed directly between the two functions.

Figure 9: Default decompiler output using Ghidra’s basic type inference

Our primary motivation for developing BTIGhidra is the need for a type inference algorithm in Ghidra that can propagate user-provided type information inter-procedurally. For such an algorithm to be useful, it should not guess a “wrong” type. If the user submits precise and correct type information, then the type inference algorithm should not derive conflicting type information that prevents user-provided types from being used. For instance, if the user provides a correct type float and we infer a type int, then these types will conflict resulting in a type error (represented formally by a bottom lattice value). Therefore, inferred types must be conservative; the algorithm should not derive a type for a program variable that conflicts with its source-level type. In a type system with subtyping, this property can be phrased more precisely as “an inferred type for a program variable should always be a supertype of the actual type of the program variable.”

In addition to support for user-provided types, BTIGhidra overcomes many other shortcomings of Ghidra’s built-in type inference algorithm. Namely, BTIGhidra can operate over stripped binaries, synthesize composite types, ingest user-provided type constraints, derive conservative typing judgments, and collect a well-defined global view of a binary’s types.

Bringing type-inference to binaries

At the source level, type inference algorithms work by collecting type constraints on program terms that are expressed in the program text, which are then solved to produce a type for each term. BTIGhidra operates on similar principles, but needs to compensate for information loss introduced by compilation and C’s permissive types. BTIGhidra uses an expressive type system that supports subtyping, polymorphism, and recursive types to reason about common programming idioms in C that take advantage of the language’s weak types to emulate these type system features. Also, subtyping, when combined with reaching definitions analysis, allows the type inference algorithm to handle compiler-introduced behavior, such as register and stack variable reuse.

Binary type inference proceeds similarly, but information lost during compilation increases the difficulty of collecting type constraints. To meet this challenge, BTIGhidra runs various flow-sensitive data-flow analyses (e.g., value-set analysis) provided by and implemented using FKIE-CAD’s cwe_checker to track how values flow between program variables. These flows inform which variables or memory objects must be subtypes of other objects. Abstractly, if a value flows from a variable x into a variable y, then we can conservatively conclude that x is a subtype of y.

Using this data-flow information, BTIGhidra independently generates subtyping constraints for each strongly connected component (SCC)² of functions in the binary’s call graph. Next, BTIGhidra simplifies signatures by using a set of proof rules to solve for all derivable relationships between interesting variables (i.e., type constants like int and size_t, functions, and global variables) within an SCC. These signatures act as a summary of the function’s typing effects when it is called. Finally, BTIGhidra solves for the type sketch of each SCC, using the signatures of called SCCs as needed.

Type sketches are our representation of recursively constrained types. They represent a type as a directed graph, with edges labeled by fields that represent the capabilities of a type and nodes labeled by a bound [lb,ub]. Figure 10 shows an example of a type sketch for the value_dump function signature. As an example, the path from node 3 to 8 can be read as “the type with ID 3 is a function that has a second in parameter which is an atomic type that is a subtype of size_t and a supertype of bottom.” These sketches provide a convenient representation of types when lowering to C types through a fairly straightforward graph traversal. Type sketches also form a lattice with a join and meet operator defined by language intersection and union, respectively. These operations are useful for manipulating types while determining the most precise polymorphic type we can infer for each function in the binary. Join allows the algorithm to determine the least supertype of two sketches, and meet allows the algorithm to determine the greatest subtype of two sketches.

Figure 10: Type sketch for the value_dump function signature

The importance of polymorphic type inference

Using a type system that supports polymorphism may seem odd for inferring C types when C has no explicit support for polymorphism. However, polymorphism is critical for maintaining conservative types in the presence of C idioms, such as handling multiple types in a function by dispatching over a void pointer. Perhaps the most canonical examples of polymorphic functions in C are malloc and free.

Figure 11: Example program that uses free polymorphically

In the example above, we consider a simple (albeit contrived) program that passes two structs to free. We access the fields of both foo and bar to reveal field information to the type inference algorithm. To demonstrate the importance of polymorphism, I modified the constraint generation phase of type inference to generate a single formal type variable for each function, rather than a type variable per call site. This change has the effect of unifying all constraints on free, regardless of the calling context.

The resulting unsound decompilation is as follows:

struct_for_node_0_13 * produce(struct_for_node_0_13
*param_1,struct_for_node_0_13 *param_2)

{
  param_1->field_at_0 = param_2->field_at_8;
  param_1->field_at_8 = param_2->field_at_0;
  param_1->field_at_16 = param_2->field_at_0;
  free(param_1);
  free(param_2);
  return param_1;
}

Figure 12: Unsound inferred type for the parameters to produce

The assumption that function calls are non-polymorphic leads to inferring an over-precise type for the function’s parameters (shown in figure 12), causing both parameters to have the same type with three fields.

Instead of unifying all call sites of a function, BTIGhidra generates a type variable per call site and unifies the actual parameter type with the formal parameter type only if the inferred type is structurally equal after a refinement pass. This conservative assumption allows BTIGhidra to remain sound and derive the two separate types for the parameters to the function in figure 11:

struct_for_node_0_16 * produce(struct_for_node_0_16
*param_1,struct_for_node_0_20 *param_2)

{
  param_1->field_at_0 = param_2->field_at_8;
  param_1->field_at_8 = param_2->field_at_0;
  param_1->field_at_16 = param_2->field_at_0;
  free(param_1);
  free(param_2);
  return param_1;
}

Evaluating BTIGhidra

Inter-procedural type inference on binaries operates over a vast set of information collected on the target program. Each analysis involved is a hard computational problem in its own right. Ghidra and our flow-sensitive analyses use heuristics related to control flow, ABI information, and other constructs. These heuristics can lead to incorrect type constraints, which can have wide-ranging effects when propagated.

Mitigating these issues requires a strong testing and validation strategy. In addition to BTIGhidra itself, we also released BTIEval, a tool for evaluating the precision of type inference on binaries with known ground-truth types. BTIEval takes a binary with debug information and compares the types recovered by BTIGhidra to those in the debug information (the debug info is ignored during type inference). The evaluation utility aggregates soundness and precision metrics. Utilizing BTIEval more heavily and over more test binaries will help us provide better correctness guarantees to users. BTIEval also collects timing information, allowing us to evaluate the performance impacts of changes.

Give BTIGhidra a try

The pre-built Ghidra plugin is located here or can be built from the source. The walkthrough instructions are helpful for learning how to run the analysis and update it with new type signatures. We look forward to getting feedback on the tool and welcome any contributions!

Acknowledgments

BTIGhidra’s underlying type inference algorithm was inspired by and is based on an algorithm proposed by Noonan et al. The methods described in the paper are patented under process patent US10423397B2 held by GrammaTech, Inc. Any opinions, findings, conclusions, or recommendations expressed in this blog post are those of the author(s) and do not necessarily reflect the views of GrammaTech, Inc.

We would also like to thank the team at FKIE-CAD behind CWE Checker. Their static analysis platform over Ghidra PCode provided an excellent base set of capabilities in our analysis.

This research was conducted by Trail of Bits based upon work supported by DARPA under Contract No. HR001121C0111 (Distribution Statement A, Approved for Public Release: Distribution Unlimited). Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the United States Government or DARPA.

¹Instructions for how to use the plugin to reproduce this demo are available here.
²A strongly connected component of a graph is a set of nodes in a directed graph where there exists a path from each node in the set to every other node in the set. Conceptually an SCC of functions separates the call graphs into groups of functions that do not recursively call each other.