Binary type inference in Ghidra

Trail of Bits is releasing BTIGhidra, a Ghidra extension that helps reverse engineers by inferring type information from binaries. The analysis is inter-procedural, propagating and resolving type constraints between functions while consuming user input to recover additional type information. This refined type information produces more idiomatic decompilation, enhancing reverse engineering comprehension. The figures below demonstrate how BTIGhidra improves decompilation readability without any user interaction:
Trail of Bits正在发布BTIGhidra,这是一个Ghidra扩展,通过从二进制文件中推断类型信息来帮助逆向工程师。分析是过程间的,在函数之间传播和解析类型约束,同时消耗用户输入以恢复附加的类型信息。这种细化的类型信息产生了更符合习惯的反编译,增强了逆向工程的理解。下图展示了BTIGhidra如何在没有任何用户交互的情况下提高反编译的可读性:

Figure 1: Default Ghidra decompiler output
图1:默认Ghidra反编译器输出

Figure 2: Ghidra output after running BTIGhidra
图2:运行BTIGhidra后的Ghidra输出

Precise typing information transforms odd pointer arithmetic into field accesses and void* into the appropriate structure type; introduces array indexing where appropriate; and reduces the clutter of void* casts and dereferences. While type information is essential to high-quality decompilation, the recovery of precise type information unfortunately presents a major challenge for decompilers and reverse engineers. Information about a variable’s type is spread throughout the program wherever the variable is used. For reverse engineers, it is difficult to keep a variable’s dispersed usages in their heads while reasoning about a local type. We created BTIGhidra in an effort to make this challenge a thing of the past.
精确的类型信息将奇数指针运算转换为字段访问,将 void* 转换为适当的结构类型;在适当的地方引入数组索引;并减少 void* 转换和解引用的混乱。虽然类型信息对于高质量的反编译是必不可少的,但不幸的是,精确类型信息的恢复对反编译器和逆向工程师来说是一个重大挑战。有关变量类型的信息分布在程序中任何使用该变量的地方。对于逆向工程师来说,在推理局部类型时,很难在头脑中保持变量的分散用法。我们创建BTIGhidra是为了让这个挑战成为过去。

A simple example 一个简单的例子

Let’s see how BTIGhidra can improve decompiler output for an example binary taken from a CTF challenge called mooosl (figure 3). (Note: Our GitHub repository has directions for using the plugin to reproduce this demo.) The target function, called lookup, iterates over nodes in a linked list until it finds a node with a matching key in a hashmap stored in list_heads.1 This function hashes the queried key, then selects the linked list that stores all nodes that have a key equal to that hash. Next, it traverses the linked list looking for a key that is equal to the key parameter.
让我们来看看BTIGhidra如何改进反编译器的输出,这个输出来自一个名为mooosl的CTF挑战的二进制代码示例(图3)。(Note:我们的GitHub存储库有使用插件复制此演示的说明。目标函数名为 lookup ,迭代链表中的节点,直到它在存储在 list_heads 中的散列表中找到一个具有匹配键的节点。#2 #此函数对查询的键进行散列,然后选择存储具有等于该散列的键的所有节点的链表。接下来,它遍历链表,查找与key参数相等的键。

Figure 3: Linked-list lookup function from mooosl
图3:mooosl的链接列表查找函数

The structure for linked list nodes (figure 4) is particularly relevant to this example. The structure has buffers for the key and value stored in the node, along with sizes for each buffer. Additionally, each node has a next pointer that is either null or points to the next node in the linked list.
链表节点的结构(图4)与本例特别相关。该结构具有用于存储在节点中的键和值的缓冲区,沿着每个缓冲区的大小。此外,每个节点都有一个下一个指针,该指针要么为空,要么指向链表中的下一个节点。

Figure 4: Linked list node structure definition
图4:链表节点结构定义

Figure 5 shows Ghidra’s initial decompiler output for the lookup function (FUN_001014fb). The overall decompilation quality is low due to poor type information across the function. For example, the recursive pointer next in the source code causes Ghidra to emit a void** type for the local variable (local_18), and the return type. Also, the type of the key_size function parameter, referred to as param_2 in the output, is treated as a void* type despite not being loaded from. Finally, the access to the global variable that holds linked list head nodes, referred to as DAT_00104010, is not treated as an array indexing operation.
图5显示了Ghidra对查找函数( FUN_001014fb )的初始反编译器输出。由于整个函数的类型信息很差,因此总体反编译质量很低。例如,源代码中的递归指针next会导致Ghidra为局部变量( local_18 )发出 void** 类型和返回类型。此外, key_size 函数参数的类型(在输出中称为 param_2 )被视为void* 类型,尽管没有从中加载。最后,对保存链表头节点的全局变量(称为 DAT_00104010 )的访问不被视为数组索引操作。

Figure 5: Ghidra decompiler output for the lookup function without type inference.
图5:没有类型推断的查找函数的Ghidra反编译器输出。

Highlighted red text is changed after running type inference.
在运行类型推断后,突出显示的红色文本将发生更改。

Figure 6 shows a diff against the code in figure 5 after running BTIGhidra. Notice that the output now captures the node structure and the recursive type for the next pointer, typed as struct_for_node_0_9* instead of void**. BTIGhidra also resolves the return type to the same type. Additionally, the key_size parameter (param_2) is no longer treated as a pointer. Finally, the type of the global variable is updated to a pointer to linked list node pointers (PTR_00104040), causing Ghidra to treat the load as an array indexing operation.
图6显示了运行BTIGhidra后与图5中代码的差异。注意,输出现在捕获了节点结构和下一个指针的递归类型,类型为 struct_for_node_0_9* 而不是 void** 。BTIGhidra也将返回类型解析为相同的类型。此外, key_size 参数( param_2 )不再被视为指针。最后,全局变量的类型被更新为指向链表节点指针的指针( PTR_00104040 ),导致Ghidra将加载视为数组索引操作。

Figure 6: Ghidra decompiler output for the lookup function with type inference.
图6:带有类型推断的查找函数的Ghidra反编译器输出。

Highlighted green text was added by type inference.
突出显示的绿色文本是通过类型推断添加的。

BTIGhidra infers types by collecting a set of subtyping constraints and then solving those constraints. Usages of known function signatures act as sources for type constraints. For instance, the call to memcmp in figure 5 results in a constraint on param_2 declaring that param2 must be a subtype size_t. Notice in the figure that BTIGhidra also successfully identifies the four fields used in this function, while also recovering the additional fields used elsewhere in the binary.
BTIGhidra通过收集一组子类型约束然后求解这些约束来推断类型。已知函数签名的用法充当类型约束的来源。例如,图5中对 memcmp 的调用导致对 param_2 的约束,声明 param2 必须是子类型 size_t 。请注意,在图中,BTIGhidra还成功地识别了该函数中使用的四个字段,同时还恢复了二进制文件中其他地方使用的其他字段。

Additionally, users can supply a known function signature to provide additional type information for the type inference algorithm to propagate across the decompiled program. Figure 6 demonstrates how new type information from a known function signature (value_dump in this case) flows from a call site to the return type from the lookup function (referred to as FUN_001014fb in the decompiled output) in figure 5. The red line depicts how the user-defined function signature for value_dump is used to infer the types of field_at_8 and field_at_24 for the returned struct_for_node_0_9 from the original function FUN_001014fb. The type information derived from this call is combined with all other call sites to FUN_001014fb in order to remain conservative in the presence of polymorphism.
此外,用户可以提供已知的函数签名,为类型推断算法提供额外的类型信息,以便在反编译程序中传播。图6演示了来自已知函数签名(在本例中为 value_dump )的新类型信息如何从调用站点流向图5中查找函数(在反编译输出中称为 FUN_001014fb )的返回类型。红线描述了如何使用 value_dump 的用户定义函数签名来推断从原始函数 FUN_001014fb 返回的 struct_for_node_0_9 的 field_at_8 和 field_at_24 类型。从该调用中得到的类型信息与 FUN_001014fb 的所有其他调用位点组合,以便在多态性存在时保持保守。

Figure 7: Back-propagation of type information derived from value_dump function signature
图7:从value_dump函数签名派生的类型信息的反向传播

Ultimately, BTIGhidra fills in the type information for the recovered structure’s used fields, shown in figure 8. Here, we see that the types for field_at_8 and field_at_24 are inferred via the invocation of value_dump. However, the fields with type undefined8 indicate that the field was not sufficiently constrained by the added function signature to derive an atomic type for the field (i.e., there are no usages that relate the field to known type information); the inference algorithm has determined only that the field must be eight bytes.
最后,BTIGhidra为恢复的结构的used字段填充类型信息,如图8所示。在这里,我们看到 field_at_8 和 field_at_24 的类型是通过调用 value_dump 推断出来的。然而,具有类型 undefined8 的字段指示该字段未被添加的函数签名充分约束以导出该字段的原子类型(即,没有将字段与已知类型信息相关的用法);推断算法仅确定字段必须是八个字节。

Figure 8: Struct type information table for decompiled linked list nodes
图8:反编译链表节点的结构类型信息表

Ghidra’s decompiler does perform some type propagation using known function signatures provided by its predefined type databases that cover common libraries such as libc. When decompiling the binary’s functions that call known library functions, these type signatures are used to guess likely types for the variables and parameters of the calling function. This approach has several limitations. Ghidra does not attempt to synthesize composite types (i.e., structs and unions) without user intervention; it is up to the user to define when and where structs are created. Additionally, this best-effort type propagation approach has limited inter-procedural power. As shown in figure 9, Ghidra’s default type inference results in conflicting types for FUN_1014fb and FUN_001013db (void* versus long and ulong), even though parameters are passed directly between the two functions.
Ghidra的反编译器确实使用其预定义类型数据库提供的已知函数签名来执行一些类型传播,这些数据库覆盖了常见的库,如 libc 。当反编译调用已知库函数的二进制函数时,这些类型签名用于猜测调用函数的变量和参数的可能类型。这种方法有几个局限性。Ghidra并不试图合成复合类型(即,结构和联合),而无需用户干预;由用户来定义何时以及在何处创建结构。此外,这种尽力而为类型的传播方法具有有限的过程间能力。如图9所示,Ghidra的默认类型推断导致 FUN_1014fb 和 FUN_001013db 的类型冲突( void* 与 long 和 ulong ),即使参数直接在两个函数之间传递。

Figure 9: Default decompiler output using Ghidra’s basic type inference
图9:使用Ghidra的基本类型推断的默认反编译器输出

Our primary motivation for developing BTIGhidra is the need for a type inference algorithm in Ghidra that can propagate user-provided type information inter-procedurally. For such an algorithm to be useful, it should not guess a “wrong” type. If the user submits precise and correct type information, then the type inference algorithm should not derive conflicting type information that prevents user-provided types from being used. For instance, if the user provides a correct type float and we infer a type int, then these types will conflict resulting in a type error (represented formally by a bottom lattice value). Therefore, inferred types must be conservative; the algorithm should not derive a type for a program variable that conflicts with its source-level type. In a type system with subtyping, this property can be phrased more precisely as “an inferred type for a program variable should always be a supertype of the actual type of the program variable.”
我们开发BTIGhidra的主要动机是需要一个类型推理算法在Ghidra,可以传播用户提供的类型信息inter-procedurally。为了使这样的算法有用,它不应该猜测“错误”的类型。如果用户提交了精确和正确的类型信息,则类型推断算法不应派生出阻止使用用户提供的类型的冲突类型信息。例如,如果用户提供了正确的类型 float ,而我们推断出类型 int ,那么这些类型将冲突,导致类型错误(形式上由底部网格值表示)。因此,推断类型必须是保守的;算法不应为与其源级别类型冲突的程序变量派生类型。在具有子类型的类型系统中,这个属性可以更精确地表述为“程序变量的推断类型应该总是程序变量的实际类型的超类型”。

In addition to support for user-provided types, BTIGhidra overcomes many other shortcomings of Ghidra’s built-in type inference algorithm. Namely, BTIGhidra can operate over stripped binaries, synthesize composite types, ingest user-provided type constraints, derive conservative typing judgments, and collect a well-defined global view of a binary’s types.
除了支持用户提供的类型之外,BTIGhidra还克服了Ghidra内置类型推理算法的许多其他缺点。也就是说,BTIGhidra可以在剥离的二进制文件上操作,合成复合类型,摄取用户提供的类型约束,导出保守的类型判断,并收集二进制文件类型的定义良好的全局视图。

Bringing type-inference to binaries
将类型推断引入二进制文件

At the source level, type inference algorithms work by collecting type constraints on program terms that are expressed in the program text, which are then solved to produce a type for each term. BTIGhidra operates on similar principles, but needs to compensate for information loss introduced by compilation and C’s permissive types. BTIGhidra uses an expressive type system that supports subtyping, polymorphism, and recursive types to reason about common programming idioms in C that take advantage of the language’s weak types to emulate these type system features. Also, subtyping, when combined with reaching definitions analysis, allows the type inference algorithm to handle compiler-introduced behavior, such as register and stack variable reuse.
在源代码级别,类型推理算法通过收集程序文本中表达的程序术语的类型约束来工作,然后求解这些类型约束以产生每个术语的类型。BTIGhidra基于类似的原则运行,但需要补偿由编译和C的许可类型引入的信息丢失。BTIGhidra使用一个支持子类型、多态和递归类型的表达型类型系统来推理C中的常见编程习惯,这些习惯利用语言的弱类型来模拟这些类型系统功能。此外,子类型与到达定义分析相结合时,允许类型推断算法处理编译器引入的行为,例如寄存器和堆栈变量重用。

Binary type inference proceeds similarly, but information lost during compilation increases the difficulty of collecting type constraints. To meet this challenge, BTIGhidra runs various flow-sensitive data-flow analyses (e.g., value-set analysis) provided by and implemented using FKIE-CAD’s cwe_checker to track how values flow between program variables. These flows inform which variables or memory objects must be subtypes of other objects. Abstractly, if a value flows from a variable x into a variable y, then we can conservatively conclude that x is a subtype of y.
二进制类型推断的过程类似,但在编译期间丢失的信息增加了收集类型约束的难度。为了应对这一挑战,BTIGhidra运行各种流敏感的数据流分析(例如,使用FKIE-CAD的cwe_checker提供并实现的值集分析(value-set analysis),用于跟踪值如何在程序变量之间流动。这些流通知哪些变量或内存对象必须是其他对象的子类型。抽象地说,如果一个值从变量x流入变量y,那么我们可以保守地得出结论, x 是 y 的子类型。

Using this data-flow information, BTIGhidra independently generates subtyping constraints for each strongly connected component (SCC)2 of functions in the binary’s call graph. Next, BTIGhidra simplifies signatures by using a set of proof rules to solve for all derivable relationships between interesting variables (i.e., type constants like int and size_t, functions, and global variables) within an SCC. These signatures act as a summary of the function’s typing effects when it is called. Finally, BTIGhidra solves for the type sketch of each SCC, using the signatures of called SCCs as needed.
使用这个数据流信息,BTIGhidra独立地为二进制调用图中的每个强连接组件(SCC) 2 函数生成子类型约束。接下来,BTIGhidra通过使用一组证明规则来解决感兴趣的变量之间的所有可导出关系(即,类型常量(如 int 和 size_t 、函数和全局变量)。这些签名作为函数被调用时类型效果的总结。最后,BTIGhidra根据需要使用被调用SCC的签名来求解每个SCC的类型草图。

Type sketches are our representation of recursively constrained types. They represent a type as a directed graph, with edges labeled by fields that represent the capabilities of a type and nodes labeled by a bound [lb,ub]. Figure 10 shows an example of a type sketch for the value_dump function signature. As an example, the path from node 3 to 8 can be read as “the type with ID 3 is a function that has a second in parameter which is an atomic type that is a subtype of size_t and a supertype of bottom.” These sketches provide a convenient representation of types when lowering to C types through a fairly straightforward graph traversal. Type sketches also form a lattice with a join and meet operator defined by language intersection and union, respectively. These operations are useful for manipulating types while determining the most precise polymorphic type we can infer for each function in the binary. Join allows the algorithm to determine the least supertype of two sketches, and meet allows the algorithm to determine the greatest subtype of two sketches.
类型草图是递归约束类型的表示。它们将类型表示为有向图,边由表示类型功能的字段标记,节点由边界 [lb,ub] 标记。图10显示了 value_dump 函数签名的类型草图示例。作为示例,从节点3到8的路径可以被读取为“具有ID 3的类型是具有第二个in参数的函数,该第二个in参数是原子类型,该原子类型是 size_t 的子类型和bottom的超类型”。当通过相当直接的图遍历降低到C类型时,这些草图提供了一种方便的类型表示。类型草图还形成了一个网格,该网格具有分别由语言交集和并集定义的连接和相交运算符。这些操作对于操作类型很有用,同时确定我们可以为二进制文件中的每个函数推断出的最精确的多态类型。 Join 允许算法确定两个草图的最小超类型,并且meet允许算法确定两个草图的最大子类型。

Figure 10: Type sketch for the value_dump function signature
图10:value_dump函数签名的类型示意图

The importance of polymorphic type inference
多态类型推理的重要性

Using a type system that supports polymorphism may seem odd for inferring C types when C has no explicit support for polymorphism. However, polymorphism is critical for maintaining conservative types in the presence of C idioms, such as handling multiple types in a function by dispatching over a void pointer. Perhaps the most canonical examples of polymorphic functions in C are malloc and free.
当C没有显式支持多态性时,使用支持多态性的类型系统来推断C类型似乎很奇怪。然而,多态性对于在C语言中维护保守类型是至关重要的,比如通过空指针来处理函数中的多个类型。也许C中多态函数最典型的例子是 malloc 和 free 。

Figure 11: Example program that uses free polymorphically
图11:多态使用free的示例程序

In the example above, we consider a simple (albeit contrived) program that passes two structs to free. We access the fields of both foo and bar to reveal field information to the type inference algorithm. To demonstrate the importance of polymorphism, I modified the constraint generation phase of type inference to generate a single formal type variable for each function, rather than a type variable per call site. This change has the effect of unifying all constraints on free, regardless of the calling context.
在上面的例子中,我们考虑了一个简单的(尽管是人为的)程序,它传递了两个结构体给free。我们访问foo和bar的字段,以向类型推断算法揭示字段信息。为了证明多态性的重要性,我修改了类型推断的约束生成阶段,为每个函数生成一个形式类型变量,而不是为每个调用站点生成一个类型变量。此更改具有统一free上的所有约束的效果,而不管调用上下文如何。

The resulting unsound decompilation is as follows:
由此产生的不合理的反编译如下:

struct_for_node_0_13 * produce(struct_for_node_0_13
*param_1,struct_for_node_0_13 *param_2)

{
  param_1->field_at_0 = param_2->field_at_8;
  param_1->field_at_8 = param_2->field_at_0;
  param_1->field_at_16 = param_2->field_at_0;
  free(param_1);
  free(param_2);
  return param_1;
}

Figure 12: Unsound inferred type for the parameters to produce
图12:要生成的参数的不可靠推断类型

The assumption that function calls are non-polymorphic leads to inferring an over-precise type for the function’s parameters (shown in figure 12), causing both parameters to have the same type with three fields.
假设函数调用是非多态的,会导致为函数的参数推断出一个过于精确的类型(如图12所示),从而导致两个参数都具有相同的类型和三个字段。

Instead of unifying all call sites of a function, BTIGhidra generates a type variable per call site and unifies the actual parameter type with the formal parameter type only if the inferred type is structurally equal after a refinement pass. This conservative assumption allows BTIGhidra to remain sound and derive the two separate types for the parameters to the function in figure 11:
BTIGhidra不是统一函数的所有调用点,而是为每个调用点生成一个类型变量,并仅当推断的类型在细化后在结构上相等时才统一实际参数类型和形式参数类型。这种保守的假设允许BTIGhidra保持合理,并为图11中的函数的参数导出两个单独的类型:

struct_for_node_0_16 * produce(struct_for_node_0_16
*param_1,struct_for_node_0_20 *param_2)

{
  param_1->field_at_0 = param_2->field_at_8;
  param_1->field_at_8 = param_2->field_at_0;
  param_1->field_at_16 = param_2->field_at_0;
  free(param_1);
  free(param_2);
  return param_1;
} 

Evaluating BTIGhidra 评估BTIGhidra

Inter-procedural type inference on binaries operates over a vast set of information collected on the target program. Each analysis involved is a hard computational problem in its own right. Ghidra and our flow-sensitive analyses use heuristics related to control flow, ABI information, and other constructs. These heuristics can lead to incorrect type constraints, which can have wide-ranging effects when propagated.
对二进制文件的过程间类型推断操作于在目标程序上收集的大量信息。所涉及的每个分析本身就是一个很难的计算问题。Ghidra和我们的流敏感分析使用与控制流、ABI信息和其他结构相关的逻辑。这些语法可能导致不正确的类型约束,这在传播时可能产生广泛的影响。

Mitigating these issues requires a strong testing and validation strategy. In addition to BTIGhidra itself, we also released BTIEval, a tool for evaluating the precision of type inference on binaries with known ground-truth types. BTIEval takes a binary with debug information and compares the types recovered by BTIGhidra to those in the debug information (the debug info is ignored during type inference). The evaluation utility aggregates soundness and precision metrics. Utilizing BTIEval more heavily and over more test binaries will help us provide better correctness guarantees to users. BTIEval also collects timing information, allowing us to evaluate the performance impacts of changes.
缓解这些问题需要强大的测试和验证策略。除了BTIGhidra本身之外,我们还发布了BTIEval,这是一个用于评估具有已知真实类型的二进制文件的类型推断精度的工具。BTIEval接受带有调试信息的二进制文件,并将BTIGhidra恢复的类型与调试信息中的类型进行比较(在类型推断过程中忽略调试信息)。评估实用程序汇总了可靠性和精确性度量。更多地使用BTIEval和更多的测试二进制文件将有助于我们为用户提供更好的正确性保证。BTIEval还收集时间信息,使我们能够评估更改对性能的影响。

Give BTIGhidra a try
给给予BTIGhidra一个尝试

The pre-built Ghidra plugin is located here or can be built from the source. The walkthrough instructions are helpful for learning how to run the analysis and update it with new type signatures. We look forward to getting feedback on the tool and welcome any contributions!
预构建的Ghidra插件位于此处,也可以从源代码构建。演练说明有助于学习如何运行分析并使用新的类型签名对其进行更新。我们期待着得到反馈的工具,并欢迎任何贡献!

Acknowledgments 致谢

BTIGhidra’s underlying type inference algorithm was inspired by and is based on an algorithm proposed by Noonan et al. The methods described in the paper are patented under process patent US10423397B2 held by GrammaTech, Inc. Any opinions, findings, conclusions, or recommendations expressed in this blog post are those of the author(s) and do not necessarily reflect the views of GrammaTech, Inc.
BTIGhidra的底层类型推理算法受到努南等人提出的算法的启发,并基于该算法。本文中描述的方法已根据GrammaTech,Inc.持有的工艺专利US 10423397 B2获得专利。任何意见,调查结果,结论,或在这篇博客文章中表达的建议是作者(S),并不一定反映GrammaTech,Inc.的观点。

We would also like to thank the team at FKIE-CAD behind CWE Checker. Their static analysis platform over Ghidra PCode provided an excellent base set of capabilities in our analysis.
我们也要感谢CWE背后的FKIE-CAD团队。他们基于Ghidra PCode的静态分析平台为我们的分析提供了一套出色的基本功能。

This research was conducted by Trail of Bits based upon work supported by DARPA under Contract No. HR001121C0111 (Distribution Statement A, Approved for Public Release: Distribution Unlimited). Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the United States Government or DARPA.
这项研究是由Trail of Bits根据DARPA支持的工作进行的,合同号为HR 001121 C 0111(分发声明A,批准公开发布:无限制分发)。本材料中表达的任何意见、发现和结论或建议均为作者的观点,不一定反映美国政府或DARPA的观点。

1Instructions for how to use the plugin to reproduce this demo are available here.
1 如何使用插件重现此演示的说明可在此处获得。

2A strongly connected component of a graph is a set of nodes in a directed graph where there exists a path from each node in the set to every other node in the set. Conceptually an SCC of functions separates the call graphs into groups of functions that do not recursively call each other.
2 图的强连通分支是有向图中的节点集合,其中存在从集合中的每个节点到集合中的每个其他节点的路径。从概念上讲,函数的SCC将调用图分隔为不递归调用彼此的函数组。

原文始发于Ian Smith:Binary type inference in Ghidra

版权声明:admin 发表于 2024年2月8日 上午8:59。
转载请注明:Binary type inference in Ghidra | CTF导航

相关文章