GPT VS MALWARE ANALYSIS: CHALLENGES AND MITIGATIONS

逆向病毒分析 12个月前 admin

69 0 0

Key Takeaways 关键要点

We identify, and give concrete examples of, natural strengths and typical challenges OpenAI’s GPT-4 (henceforth ‘GPT’) faces when reasoning in the domain of malware analysis.
我们识别并举例说明OpenAI的GPT-4（以下简称“GPT”）在恶意软件分析领域进行推理时面临的自然优势和典型挑战。
We drill down into the source and the shape of the ‘ceiling’ hampering the application of GPT to several malware analyst tasks.
我们深入研究了阻碍 GPT 应用于多个恶意软件分析师任务的来源和“天花板”的形状。
We introduce hacks and mitigations to overcome this ‘ceiling’ and expand GPT’s ability to reason in this domain.
我们引入了黑客和缓解措施来克服这个“天花板”，并扩展 GPT 在该领域的推理能力。
As a proof of concept, we show a heavily engineered prompt that improves GPT’s ability to correctly guide an analyst performing triage on the tested binary samples.
作为概念验证，我们展示了一个经过大量设计的提示，该提示提高了 GPT 正确指导分析师对测试的二进制样本进行分类的能力。

Introduction 介绍

GPT technology is the current tech cycle’s veritable miracle. The skeptics insist that it just has the appearance of intelligence, and try to cast it as ‘just the latest buzzword’, making snide comparisons to NFTs and blockchains. But from intimate experience, we can say these comparisons are deeply unfair — we’ve had projects that should have taken a week be done in a few hours with GPT’s assistance, a feat that no NFT has ever achieved that we’re aware of. On the other side of the fence, one does sometimes encounter “AI mania”, which views GPT as essentially omnipotent magic that can be applied to unsolved problems to render them solved, and is amazed that the great AI uprising has somehow not happened yet.
GPT 技术是当前技术周期名副其实的奇迹。怀疑论者坚持认为它只是具有智能的外观，并试图将其塑造成“最新的流行语”，与 NFT 和区块链进行冷嘲热讽的比较。但从亲密的经验来看，我们可以说这些比较是非常不公平的——在 GPT 的帮助下，我们本应花一周时间完成的项目在几个小时内完成，这是我们所知道的 NFT 从未实现过的壮举。在篱笆的另一边，人们有时确实会遇到“人工智能狂热”，它认为 GPT 本质上是无所不能的魔法，可以应用于未解决的问题以解决这些问题，并且惊讶于伟大的 AI 起义不知何故还没有发生。

GPT VS MALWARE ANALYSIS: CHALLENGES AND MITIGATIONS — Figure 1 – We’re not quite there yet. (Credit: SMBC comics)
图 1 – 我们还没有完全实现。（图片来源：SMBC漫画）

The truth about GPT technology and its capabilities is more complex and nuanced than any of these simplified views, and we figured that the best way to get acquainted with it would be via a practical tour. We pitted OpenAI’s GPT-4 (henceforth “GPT”) against the daunting, sometimes unfair, task of dealing with a piece of malware. The first task was triage: given a binary, identify it and deliver a benign or malicious verdict. As it turned out, seeing GPT deal with this apparently simple task already produced a wealth of insight regarding its ability to reason in this domain. This text focuses on the insights from observing GPT grappling with the triage task, though highlights from the other tasks do occasionally appear. Quotes are verbatim, except for some faithful reconstructions which are marked with an asterisk.
关于 GPT 技术及其功能的真相比任何这些简化视图都更加复杂和微妙，我们认为熟悉它的最佳方式是通过实际游览。我们将 OpenAI 的 GPT-4（以下简称“GPT”）与处理恶意软件的艰巨、有时甚至是不公平的任务进行了对比。第一项任务是分类：给定一个二进制文件，识别它并提供良性或恶意的判决。事实证明，看到 GPT 处理这个看似简单的任务已经产生了关于它在该领域推理能力的丰富见解。本文重点介绍观察 GPT 处理分类任务的见解，尽管偶尔会出现其他任务的亮点。引号是逐字逐句的，除了一些用星号标记的忠实重建。

It is important to understand that the below is not an objective, ground-up technical analysis of the capabilities of GPT. Rather, it is a summary of our impressions when interacting with GPT and repeatedly observing how it responds to the specific tasks we had handed it. These tests were conducted using GPT-4 versions that were deployed for use in ChatGPT during July and August of 2023.
重要的是要明白，以下内容并不是对 GPT 能力的客观、自下而上的技术分析。相反，它是我们在与 GPT 交互时的印象总结，并反复观察它如何响应我们交给它的特定任务。这些测试是使用 GPT-4 版本进行的，这些版本在 2023 年 7 月和 8 月期间部署用于 ChatGPT。

The Triage Task 会审任务

We posited the triage task to GPT in the following manner:
我们按以下方式将分类任务交给 GPT：

You play the role of a security analyst who has just been handed a VirusTotal report for an executable file. Highlights of the report are given below. Make note of them and confirm you have looked through these report highlights.
您扮演的安全分析师的角色是刚刚收到可执行文件的 VirusTotal 报告。报告的要点如下。记下它们，并确认您已浏览这些报告要点。

We asked GPT to deliver its conclusions and verdict on 2 reports:
我们要求 GPT 对 2 份报告发表结论和结论：

fb136c8360d1a5ab80f61109c55c5a788aa1d8796d1e75aca8c1a762b598d3f4 (GandCrab Ransomware)
fb136c8360d1a5ab80f61109c55c5a788aa1d8796d1e75aca8c1a762b598d3f4 （ GandCrab 勒索软件）
fc45eb5c9d3f89cb059212e00512ec0e6c47c1bdf12842256ceda5d4f1371bd5 (APSDaemon.exe, Apple push executable, benign in itself, used for sideloading malicious DLLs, e.g., in Coinloader infections)
fc45eb5c9d3f89cb059212e00512ec0e6c47c1bdf12842256ceda5d4f1371bd5（APSDaemon.exe，苹果推送可执行文件，本身是良性的，用于旁加载恶意DLL，例如，在Coinloader感染中）

Below we list how GPT fared when dealing with this task, and some of the others — where it displayed its natural strengths and where we ran into challenges that had to be creatively worked around, with in-line examples where warranted. In addition to examples straight from the malware tasks, we’ve added some examples from other prompts and tasks that we have tested GPT with, that we feel demonstrate some points particularly well.
下面我们列出了 GPT 在处理这项任务时的表现，以及其他一些任务——它在哪些方面展示了它的天然优势，以及我们在哪些方面遇到了必须创造性地解决的挑战，并在必要时提供了内联示例。除了直接来自恶意软件任务的示例外，我们还添加了一些来自其他提示和任务的示例，我们测试了 GPT，我们认为这些示例特别很好地展示了一些要点。

GPT’s Natural Strengths GPT 的天然优势

Sentence Completion 句子完成

GPT is a totally verbal thinker. Its entire power is predicated on an outstanding capability to decide what’s the most appropriate word to put, and where, in its response. This is one of the most important things to understand about GPT — a lot of the behavior that we will cover later is, in a sense, downstream from this one property.
GPT 是一个完全口头思考的人。它的全部力量都建立在一种出色的能力之上，即决定在回应中放置什么词最合适，以及放在什么位置。这是了解 GPT 最重要的事情之一——从某种意义上说，我们稍后将介绍的许多行为都是这个属性的下游。

One of the immediate implications of this is that GPT has access to a huge latent cheat sheet. If someone, at any point in history, has answered the actual question being asked and this answer has made it into GPT’s training data, GPT exhibits an uncanny ability to reproduce the answer. For example, in response to the GandCrab report, GPT correctly recalls that:
这样做的直接影响之一是 GPT 可以访问一个巨大的潜在备忘单。如果有人在历史上的任何时候回答了被问到的实际问题，并且这个答案已经进入了 GPT 的训练数据，那么 GPT 就会表现出一种不可思议的能力来重现答案。例如，在回应 GandCrab 报告时，GPT 正确地回忆起：

The executable runs the “wmic.exe” with the “shadowcopy delete” command, which is often used by ransomware to delete Volume Shadow Copies. Volume Shadow Copy is a service in Windows that automatically creates backup copies or snapshots of computer files or volumes, even when they are in use. Deleting shadow copies is a common tactic used by ransomware to prevent victims from recovering encrypted files without paying the ransom.
可执行文件使用“shadowcopy delete”命令运行“wmic.exe”，勒索软件通常使用该命令来删除卷影副本。卷影复制是 Windows 中的一项服务，可自动创建计算机文件或卷的备份副本或快照，即使它们正在使用中也是如此。删除卷影副本是勒索软件常用的一种策略，用于防止受害者在不支付赎金的情况下恢复加密文件。

It is important to note that in some situations, GPT will fail to recall an answer that looks like it “should” be on this cheat sheet. An example of this, taken from outside the realm of malware analysis, can be seen below:
需要注意的是，在某些情况下，GPT 将无法回忆起看起来“应该”出现在此备忘单上的答案。从恶意软件分析领域之外获取的一个例子如下：

USER: Can you find an example of [a proof that P=NP implies EXP=NEXP] in the literature and cite it verbatim?
用户：你能在文献中找到一个[证明 P=NP 意味着 EXP=NEXP] 的例子并逐字引用吗？
GPT: [consults Google scholar] I apologize for the confusion earlier. Upon further reflection, I realize that I made a mistake in my explanation. The statement “If P = NP then EXP = NEXP” is not a standard result in complexity theory, and I am not aware of a proof or counterexample in the literature.
GPT：[咨询谷歌学者]我为之前的混乱道歉。经过进一步的思考，我意识到我的解释犯了一个错误。“如果 P = NP，则 EXP = NEXP”这句话不是复杂性理论中的标准结果，我不知道文献中的证明或反例。

The proof does exist, but GPT first ran a google scholar search and could not find the proof in the first page of results. Sentence completion had trouble recovering from this error. It’s easy to speculate how this happened: if one tries to complete the sentence: “Looking at a list of Google scholar search results for ‘if P=NP then EXP=NEXP’, I have found: None. Therefore…”, GPT’s response comes almost naturally.
证据确实存在，但 GPT 首先进行了谷歌学术搜索，在结果的第一页找不到证据。句子完成无法从此错误中恢复。很容易推测这是怎么发生的：如果有人试图完成这句话：“查看 Google 学术搜索结果列表，查找’if P=NP then EXP=NEXP’，我发现：无。因此……“，GPT 的反应几乎是自然而然的。

Big Picture Summaries 大图摘要

Through its web of word associations, GPT has a keen grasp of grammar and of the difference between key vs. ancillary facts. As a result, one of the tasks where GPT can be trusted to perform most reliably is producing “a summary of the big picture” when given input which is too large for comfort for human consumption. For example, when given part of a very lengthy API call log produced by a piece of malware and asked to summarize the log, GPT produced the below useful output:
通过其单词关联网络，GPT 对语法以及关键事实与辅助事实之间的区别有着敏锐的把握。因此，GPT 可以被信任执行最可靠的任务之一是在给定输入时生成“大局摘要”，该输入太大而无法满足人类消费。例如，当给定由恶意软件生成的非常长的 API 调用日志的一部分并要求对日志进行总结时，GPT 会生成以下有用的输出：

The malware seems to be heavily interacting with Windows API and doing various operations such as file operations, memory management, privilege escalation, loading libraries, and notably cryptography-related operations.
该恶意软件似乎与 Windows API 进行了大量交互，并执行各种操作，例如文件操作、内存管理、权限提升、加载库，尤其是与加密相关的操作。

A few notable points:
几点值得注意的：

File Operations: The malware interacts with various file-related functions such as CreateFileA, CreateFileW, ReadFile, SetFilePointer, FindFirstFileW, and FindNextFileW. These might be used to read/write/modify files on the infected system.
文件操作：恶意软件与各种与文件相关的函数进行交互，例如 CreateFileA、CreateFileW、ReadFile、SetFilePointer、FindFirstFileW 和 FindNextFileW。这些可能用于读取/写入/修改受感染系统上的文件。

Loading Libraries: Various DLL libraries such as kernel32.dll, msvcrt.dll, ntdll.dll, bcrypt.dll, user32.dll, and advapi32.dll are being loaded. These libraries contain functions that enable the malware to interact with the system.
加载库：正在加载各种 DLL 库，例如 kernel32.dll、msvcrt.dll、ntdll.dll、bcrypt.dll、user32.dll 和 advapi32.dll。这些库包含使恶意软件能够与系统交互的功能。

[..and so on..] [..等等..]

Figure 2 – GPT does well when asked for a bird’s eye view.
图 2 – GPT 在被要求鸟瞰时表现良好。

We later discuss one of the main challenges GPT runs into when performing this task, and how to overcome that limitation (see below under ‘Memory Window Drift’).
我们稍后将讨论 GPT 在执行此任务时遇到的主要挑战之一，以及如何克服该限制（请参阅下文“内存窗口漂移”）。

Logical Reasoning 逻辑推理

GPT can leverage its underlying capacity of sentence completion to perform surprising feats of logical reasoning. Consider the following prompt.
GPT 可以利用其潜在的句子完成能力来执行令人惊讶的逻辑推理壮举。请考虑以下提示。

previous state of working memory:
以前的工作记忆状态：

BEGIN WORKING MEMORY 开始工作记忆

You are computing a function f on an input X. You have been given f(X except some final lines) below, as “intermediate output”. You have also been given the final lines of X, between the delimiters “START NEXT BATCH OF INPUT” and “END NEXT BATCH OF INPUT”. When computing the new state of working memory, copy the previous working memory verbatim, starting with “BEGIN WORKING MEMORY” and ending with “END WORKING MEMORY.” Change only working memory that appears under “intermediate output”, by replacing it with your computed value of f(X), which cannot exceed a 1,000 words.
您正在计算输入 X 上的函数 f。下面给你的 f（X 除了最后几行）作为“中间输出”。您还获得了 X 的最后一行，位于分隔符“START NEXT BATCH OF INPUT”和“END NEXT BATCH OF INPUT”之间。计算工作记忆的新状态时，逐字复制以前的工作记忆，从“BEGIN WORKING MEMORY”开始，到“END WORKING MEMORY”结束。仅更改显示在“中间输出”下的工作内存，将其替换为计算值 f（X），该值不能超过 1,000 个单词。

The function: f(X) = “sum”
函数：f（X） = “sum”

Intermediate output: 2 中间输出：2

END WORKING MEMORY 结束工作记忆

START NEXT BATCH OF INPUT
开始下一批输入

Eleven. 十一。

END NEXT BATCH OF INPUT
结束下一批输入

Please output, with no introductions, addendums or ceremony, the next state of the working memory after processing the above lines of input.
请在处理上述输入行后输出工作记忆的下一个状态，不带介绍、附录或仪式。

Plenty of humans would fail to process this request, but GPT responds correctly:
很多人都无法处理这个请求，但 GPT 会正确响应：

BEGIN WORKING MEMORY 开始工作记忆

You are computing a function f on an input X. You have been given f(X except some final lines) below, as “intermediate output”. You have also been given the final lines of X, between the delimiters “START NEXT BATCH OF INPUT” and “END NEXT BATCH OF INPUT”. When computing the new state of working memory, copy the previous working memory verbatim, starting with “BEGIN WORKING MEMORY” and ending with “END WORKING MEMORY.” Change only working memory that appears under “intermediate output”, by replacing it with your computed value of f(X), which cannot exceed a 1,000 words.
您正在计算输入 X 上的函数 f。下面给你的 f（X 除了最后几行）作为“中间输出”。您还获得了 X 的最后一行，位于分隔符“START NEXT BATCH OF INPUT”和“END NEXT BATCH OF INPUT”之间。计算工作记忆的新状态时，逐字复制以前的工作记忆，从“BEGIN WORKING MEMORY”开始，到“END WORKING MEMORY”结束。仅更改显示在“中间输出”下的工作内存，将其替换为计算值 f（X），该值不能超过 1,000 个单词。

The function: f(X) = “sum”
函数：f（X） = “sum”

Intermediate output: 13 中间输出：13

END WORKING MEMORY 结束工作记忆

This ability should be used with care. If one overburdens GPT with too many arcane requirements and verbose conditions, it has been our experience that this magical ability begins to break down: some requirements get misunderstood, and others outright forgotten (see below, under ‘Logical Reasoning Ceiling’).
应谨慎使用此功能。如果一个人用太多晦涩难懂的要求和冗长的条件来使 GPT 负担过重，那么根据我们的经验，这种神奇的能力开始崩溃：一些要求被误解，而另一些则被完全遗忘（见下文，在“逻辑推理上限”下）。

Keep this specific example in mind, as we will later use a very similar prompt as a bootstrap to overcome one of the main challenges in using GPT to process long inputs.
请记住这个具体示例，因为我们稍后将使用非常相似的提示作为引导程序，以克服使用 GPT 处理长输入的主要挑战之一。

Challenges in Practical GPT Application
GPT实际应用的挑战

While GPT is an artificial construct, many of the challenges one encounters when applying GPT to the domain of malware analysis seem strangely human on GPT’s part. We collected many examples of GPT running into these challenges while attempting to deal with some task, and tried as much as possible to sort them into larger, more general categories. The result was the below list of 6 general principal obstacles:
虽然 GPT 是一种人工结构，但在将 GPT 应用于恶意软件分析领域时遇到的许多挑战对 GPT 来说似乎很奇怪。我们收集了许多 GPT 在尝试处理某些任务时遇到这些挑战的例子，并尽可能地将它们分类为更大、更一般的类别。结果是以下 6 个一般主要障碍的列表：

Memory Window Drift 内存窗口漂移
Gap between Knowledge and Action
知识与行动之间的差距
Logical Reasoning Ceiling
逻辑推理上限
Detachment from Expertise
脱离专业知识
Goal Orientation 目标导向
Spatial Blindness 空间盲症

A lot of the individual examples are gestalt expressions of more than one category. Still, in the interest of readability, we decided to place examples in-line at specific sections where they belong particularly well. We give a detailed description of each of these categories below.
许多单独的例子是不止一个类别的格式塔表达。尽管如此，为了提高可读性，我们决定将示例放在它们特别好的特定部分。我们在下面对这些类别中的每一个进行详细说明。

Memory Window Drift 内存窗口漂移

GPT operates by breaking texts down into “tokens”. Some words are processed as exactly one token, and some are broken down into two or more; the specifics here aren’t crucial for the discussion we want to have. When answering prompts, GPT looks back at a bounded “window” containing a fixed number of recent tokens. The exact number varies with the model: standard GPT-4 has 8k, and OpenAI offers a separate model with a 32k window. Either way, this can quickly become a limiting factor when dealing with large input.
GPT 通过将文本分解为“标记”来运作。有些单词被处理为一个标记，有些单词被分解为两个或更多;这里的细节对于我们想要进行的讨论并不重要。在回答提示时，GPT 会回头查看一个有界的“窗口”，其中包含固定数量的最近令牌。确切的数字因型号而异：标准 GPT-4 有 8k，OpenAI 提供了一个单独的模型，窗口为 32k。无论哪种方式，在处理大量输入时，这都会很快成为限制因素。

One must take special care preparing for the moment where the original start of the conversation leaves the token window. The beginning of the conversation is probably where the task instructions were first laid out; once the window leaves these instructions behind, GPT’s ability to adhere to them becomes much more limited. From that point on, GPT is effectively going off a second-hand task description, implied by the prompts and responses still in the window. Once an aspect of the task is no longer implied by anything in the window, it is simply lost.
必须特别注意为对话的原始开始离开令牌窗口的那一刻做好准备。对话的开始可能是最初布置任务说明的地方;一旦窗口将这些指令抛在脑后，GPT 遵守它们的能力就会变得更加有限。从那时起，GPT 实际上正在摆脱二手任务描述，这由窗口中的提示和响应所暗示。一旦窗口中的任何内容不再暗示任务的某个方面，它就会丢失。

This is a very blunt and apparent stumbling block. We’ve had colleagues testify that they tried applying GPT to their favorite problem, only to run into this exact difficulty immediately. It is not surprising, then, that the exact same thing happened during our testing. We had GPT analyze a log of API calls produced by a program that had been instrumented via TinyTracer, resulting in the following dialogue, which we have redacted for brevity.
这是一个非常直率和明显的绊脚石。我们有同事作证说，他们试图将 GPT 应用于他们最喜欢的问题，但很快就遇到了这个确切的困难。因此，在我们的测试过程中发生完全相同的事情也就不足为奇了。我们让 GPT 分析了通过 TinyTracer 检测的程序生成的 API 调用日志，从而产生了以下对话，为简洁起见，我们对其进行了编辑。

USER: You are a malware analyst who has been given an API call log of a certain strain of malware. This log has been generated using the tool ‘tinytracer’. The first 100 or so lines of the log are below. Please read through the log and confirm that you have read through it.
用户：您是一名恶意软件分析师，已获得某种恶意软件的 API 调用日志。此日志是使用工具“tinytracer”生成的。日志的前 100 行左右如下。请通读日志并确认您已通读。

[100 lines of API call log, starting with: 8f4d;section: [.text] 968f;kernel32.HeapCreate a047;kernel32.GetModuleHandleW a069;kernel32.GetProcAddress GetProcAddress: Arg[0] = ptr 0x77100000 -> {MZ} Arg[1] = ptr 0x10013c24 -> “FlsAlloc” …]
【100行API调用记录，开头：8f4d;部分： [.text] 968f;内核32.堆创建 a047;内核32.GetModuleHandleW a069;内核32.GetProcAddress GetProcAddress： Arg[0] = ptr 0x77100000 -> {MZ} Arg[1] = ptr 0x10013c24 -> “FlsAlloc” …]

GPT: I have read through the provided API call log, which appears to show the activities of a suspected malware. Here’s a brief summary: [proceeds to adequately summarize the events in the log so far]
GPT：我已经阅读了提供的 API 调用日志，其中似乎显示了可疑恶意软件的活动。下面是一个简短的摘要：[继续充分总结到目前为止日志中的事件]

USER: [100 additional lines of API call log]
USER：[新增 100 行 API 调用日志]

GPT: [adequately summarizes the events in the 100 additional lines]
GPT：[在另外 100 行中充分总结了事件]

USER: [100 additional lines of API call log]
USER：[新增 100 行 API 调用日志]

GPT: It appears this log comes from a system monitoring tool or a dynamic binary analysis tool. It is listing the different Windows API calls made by a certain application or process…
GPT：此日志似乎来自系统监控工具或动态二进制分析工具。它列出了由某个应用程序或进程发出的不同 Windows API 调用……

From GPT’s point of view, it sees four hundred lines of API log lines interspersed with short summaries. To some extent it can infer what it is expected to do, but the original instructions in the first prompt have been lost.
从 GPT 的角度来看，它看到 400 行 API 日志行，其中穿插着简短的摘要。在某种程度上，它可以推断出它应该做什么，但第一个提示中的原始指令已经丢失。

Mitigation 缓解

You will be happy to know that memory window drift can be mitigated. That doesn’t mean the effect disappears entirely, but it becomes more similar to what you might encounter when conversing with a human. GPT will examine conversation still inside the window and respond to it in high fidelity, while keeping a stripped-down compressed memory of the past.
您会很高兴知道可以缓解内存窗口漂移。这并不意味着这种效果会完全消失，但它会变得更类似于您在与人类交谈时可能遇到的情况。GPT 将检查仍在窗口内的对话并以高保真度做出回应，同时保留过去的精简压缩记忆。

How is this magic trick pulled off? Recall that earlier, we saw how GPT-4 specifically can perform powerful feats of logical reasoning; specifically, for some function f, it can take f, f(X), Y as inputs and use them to compute f(X|Y). This is conceptually similar to what is known, in mathematics, as a ‘proof by induction’. You can make use of it yourself by copy-pasting the template below and tweaking it to your needs.
这个魔术是如何实现的？回想一下，早些时候，我们看到了 GPT-4 如何具体执行强大的逻辑推理壮举;具体来说，对于某些函数 f ，它可以将 f 、 f(X) 作为 Y 输入并使用它们来计算 f(X|Y) 。这在概念上类似于数学中所谓的“归纳证明”。您可以通过复制粘贴下面的模板并根据需要进行调整来自己使用它。

You’re computing f(X). You’re given f(X except the final few lines) (henceforth: “BASE”) and the final few lines of X (henceforth: “STEP”).
您正在计算 f（X）。你得到 f（X 除了最后几行）（以下简称：“BASE”）和 X 的最后几行（以下简称：“STEP”）。

f(X) is defined as: X is an initial segment of a play, starting with the first line. f(X) is a summary of X.
f（X）的定义是：X 是戏剧的初始片段，从第一行开始。f（X）是 X 的摘要。

BEGIN BASE 开始基地

In Verona, two noble families, the Montagues and the Capulets, have an old feud. The play will focus on a tragic love story between two young individuals from these families, whose deaths will ultimately reconcile the feuding households. In a public place in Verona, Sampson and Gregory, from the Capulet house, discuss their disdain for the Montagues. Their conversation involves playful banter about overcoming the Montagues in battle and affections toward women. The two prepare for a confrontation as members of the Montague house approach.
在维罗纳，两个贵族家族，蒙塔古家族和卡普莱特家族，有着古老的世仇。该剧将聚焦于来自这些家庭的两个年轻人之间的悲惨爱情故事，他们的死亡最终将使不和的家庭和解。在维罗纳的一个公共场所，来自凯普莱特家的桑普森和格雷戈里讨论了他们对蒙塔古家族的蔑视。他们的谈话涉及关于在战斗中战胜蒙塔古人的俏皮玩笑和对女性的感情。两人准备在蒙塔古家族成员接近时进行对抗。

END BASE 端基座

BEGIN STEP 开始步骤

ACT I 第一幕

SAMPSON 桑普森
My naked weapon is out: quarrel, I will back thee.
我赤裸裸的武器已经出来了：吵架，我会支持你。

[…etc, etc…] […等等，等等…]

BENVOLIO 本沃利奥
I do but keep the peace: put up thy sword,
我只保佑平安：举起你的剑，
Or manage it to part these men with me.
或者设法把这些人和我分开。

END STEP 结束步骤

Please output the EXACT VALUE of f(X) per the definition. Do not lead with any ars poetical context (“Considering the text, we must identify crucial plot points that are important to summarize…” or even “here’s the output:”) and do not conclude with any ars poetical context (“awaiting further input”). Do not mention the phrase “f(X)”.
请根据定义输出 f（X）的确切值。不要以任何诗歌语境开头（“考虑到文本，我们必须确定重要的情节点，需要总结……”甚至“这是输出：”），并且不要以任何 ARS 诗歌上下文（“等待进一步的输入”）结束。不要提及短语“f（X）”。_^*

The new BASE returned by GPT can be used in the next prompt, with the STEP set to the next part of the input, to obtain an even more updated BASE. The final output is the BASE after it has been sequentially updated with every STEP. You can use this method to process input of arbitrary length, limited only by your quota and by a well-chosen, well-worded f(X). We have used this method (in conjunction with other nudges) to get GPT to successfully complete the API log summarization task.
GPT 返回的新 BASE 可以在下一个提示中使用，并将 STEP 设置为输入的下一部分，以获得更新的 BASE。最终输出是 BASE，它在每个 STEP 中按顺序更新后。您可以使用此方法处理任意长度的输入，仅受配额和精心选择、措辞良好的 f(X) .我们使用此方法（结合其他微移）让 GPT 成功完成 API 日志汇总任务。

Two quick tips: 两个快速提示：

When processing the first part of the input, set the BASE to “none so far”.
处理输入的第一部分时，将 BASE 设置为“到目前为止没有”。
Sometimes a more difficult, more detailed f(X) ‘paradoxically’ results in an easier task for GPT because it makes the transition f(X) -> f(X|Y) easier (this is called strengthening the induction hypothesis).
有时，更困难、更详细的 f(X) “矛盾”会导致 GPT 更容易完成任务，因为它使过渡 f(X) -> f(X|Y) 更容易（这称为强化归纳假设）。

Gap Between Knowledge and Action
知识与行动之间的差距

Feynman famously complained about students who “had memorized everything, but they didn’t know what anything meant… [..] they didn’t know that [‘light reflected from a medium’] meant a material such as water, so if I say, ‘Look at the water’, nothing happens – they don’t have anything under ‘Look at the water’!”. Some of the particular obstacles we encountered when applying GPT to malware analysis tasks seem to be echoes of Feynman’s frustration.
费曼曾抱怨过一些学生，“他们已经记住了一切，但他们不知道任何事情意味着什么……[..]他们不知道[‘从介质反射的光’]是指水等物质，所以如果我说，’看水’，什么也没发生——他们在’看水’下什么都没有！我们在将 GPT 应用于恶意软件分析任务时遇到的一些特殊障碍似乎是费曼挫败感的回声。

In a 2023 paper, Berglund et al find that
在 2023 年的一篇论文中，Berglund 等人发现

If a model is trained on a sentence of the form “A is B”, it will not automatically generalize to the reverse direction “B is A”. This is the Reversal Curse. For instance, if a model is trained on “Olaf Scholz was the ninth Chancellor of Germany”, it will not automatically be able to answer the question, “Who was the ninth Chancellor of Germany?”.
如果一个模型是用“A is B”形式的句子训练的，它不会自动泛化到相反的方向“B is A”。这就是逆转诅咒。例如，如果一个模型是以“奥拉夫·舒尔茨是德国第九任总理”为主题进行训练，它将无法自动回答“谁是德国第九任总理”的问题。

Being able to complete a sentence is not enough; attention needs to be drawn to that knowledge, and the knowledge needs to be successfully integrated with the situation at hand. The way this principle can interfere with GPT’s work can be starkly demonstrated even without resorting to complex malware analysis tasks:
能够完成一个句子是不够的;需要注意这种知识，并且需要将这种知识成功地与当前的情况结合起来。即使不诉诸复杂的恶意软件分析任务，也可以清楚地证明这一原则可以干扰 GPT 工作的方式：

USER: I think I’ve lost my car keys, and I would like you to help me find them. This morning my wife was sick and she asked me to drive our 2-year-old to kindergarten. I took her with me, seated her in the car and drove off. While driving I got a really furious call from my boss about an urgent issue I thought I’d dealt with yesterday, but apparently there were still some loose ends to tie up. I stayed on the call with him all the while driving to work, getting out of the car, taking the stairs to the office and finally after an intense hour of work, resolving the issue. I’ve now noticed my car keys are missing. Where should I look for them?
用户：我想我的车钥匙丢了，我希望你能帮我找到它们。今天早上我的妻子生病了，她让我开车送我们 2 岁的孩子去幼儿园。我把她带走，让她坐在车里，然后开车走了。在开车的时候，我接到老板的一个非常愤怒的电话，说我昨天已经处理了一个紧急问题，但显然还有一些松散的问题需要解决。我一直和他通电话，开车去上班，下车，走楼梯去办公室，最后经过一个小时的紧张工作，解决了这个问题。我现在注意到我的车钥匙不见了。我应该在哪里寻找它们？

GPT: There are several places where you could have left your keys. First, you should probably search the office table and other surfaces where you could have easily deposited the keys as you entered the office in a hurry. If the keys are not there, you might want to retrace your steps to the parking lot and see whether the keys were perhaps lost when you hurried up to your office. Finally, you might want to contact reception in case your keys have been found by someone else and handed to security already.
GPT：有几个地方你可以留下钥匙。首先，您可能应该搜索办公室桌子和其他表面，以便在匆忙进入办公室时轻松存放钥匙。如果钥匙不在那里，你可能想回溯你的脚步到停车场，看看当你匆匆赶到办公室时，钥匙是否丢失了。最后，您可能需要联系接待处，以防您的钥匙被其他人发现并交给保安。

USER: Look at what I said again and carefully!
用户：仔细再看我说的话！

GPT: Oh. On second look, you seem to have forgotten your 2-year-old in the car… [..] ^*
GPT：哦。再看一眼，你似乎把你 2 岁的孩子忘在了车里……[..] ^*

Problem-solving is a deep process involving a myriad implicit, unconscious questions: Am I focusing on the most important thing? Is what I am trying to do even possible? (One might even add: “Am I expressing myself with sufficient tact and empathy?”). One obstacle when applying GPT to a task is accidentally prodding it into a state where this process is not fully engaged.
解决问题是一个深层次的过程，涉及无数隐含的、无意识的问题：我是否专注于最重要的事情？我想要做的事情可能吗？（甚至可以补充说：“我是否以足够的机智和同理心来表达自己？将 GPT 应用于任务时的一个障碍是意外地将其推入此过程未完全参与的状态。

Self-awareness is a failsafe. To appreciate its function, you need to encounter some behavior that the failsafe should have caught. For example, consider this interaction from the Gandcrab triage test:
自我意识是一种故障保险。要了解它的功能，您需要遇到故障保护应该捕获的一些行为。例如，考虑 Gandcrab 分类测试中的这种交互：

GPT: [To identify this malware] first I would check the hash of the file (MD5, SHA-1, and SHA-256 provided in the report) in various threat intelligence platforms, like AlienVault’s Open Threat Exchange (OTX), IBM’s X-Force Exchange, and VirusTotal’s own database. This is usually the fastest and most precise way to identify a known strain of malware, as it relies on previous analysis and classification by other security researchers.
GPT：[为了识别这种恶意软件]，我首先会检查各种威胁情报平台（如 AlienVault 的 Open Threat Exchange （OTX）、IBM 的 X-Force Exchange 和 VirusTotal 自己的数据库）中文件的哈希值（报告中提供的 MD5、SHA-1 和 SHA-256）。这通常是识别已知恶意软件的最快、最精确的方法，因为它依赖于其他安全研究人员之前的分析和分类。

USER: You find no mentions of this specific hash outside of the VT report itself.
用户：在 VT 报告本身之外，您没有发现任何提及此特定哈希值的内容。

GPT: [..] The executable contacts several URLs and writes to specific directories, notably the recycle bin. The command “C:\vssadmin.exe” shadowcopy delete is particularly interesting as it suggests that the ransomware attempts to delete shadow copies, which is a common behavior of many ransomware strains to prevent data recovery. However, the exact URLs and directories might give us a clue about the specific ransomware strain we’re dealing with. Could you perform a search for ransomware associated with the observed behavior and contacted URLs?
GPT：[..]可执行文件联系多个 URL 并写入特定目录，特别是回收站。命令“C：\vssadmin.exe”卷影副本删除特别有趣，因为它表明勒索软件尝试删除卷影副本，这是许多勒索软件为阻止数据恢复而采取的常见行为。但是，确切的 URL 和目录可能会为我们提供有关我们正在处理的特定勒索软件菌株的线索。您能否搜索与观察到的行为和联系的 URL 相关的勒索软件？

Several of the other difficulties of properly applying GPT to its full potential are apparent here, and we discuss them below at greater length. But note that we can ask GPT “how relevant is deleting shadow copies for identifying a specific ransomware strain”, and get an exemplary answer. If it weren’t for the gap between knowledge and action, none of these other difficulties would manifest to begin with.
正确应用 GPT 充分发挥其潜力的其他几个困难在这里是显而易见的，我们将在下面更详细地讨论它们。但请注意，我们可以询问 GPT“删除卷影副本与识别特定勒索软件菌株的相关性如何”，并得到一个示例性的答案。如果不是知识和行动之间的差距，这些其他困难都不会从一开始就表现出来。

Mitigation 缓解

Adding the following bit to GPT’s prompt resulted in an improvement in GPT’s ability to bridge the gap:
在 GPT 的提示符中添加以下位可以提高 GPT 弥合差距的能力：

Focus on your ‘killer instinct’ and intellectual autonomy. Call out absurdities! Pay attention to incongruities and opportunities!
专注于你的“杀手本能”和智力自主。大声疾呼荒谬！注意不协调和机会！

Another approach we have had success with is feeding GPT’s own responses to itself and asking to look at them critically. In some cases, this has a surprisingly powerful effect, as can be seen in the above ‘car keys’ example.
我们成功的另一种方法是将 GPT 自己的反应提供给自己，并要求批判性地看待它们。在某些情况下，这会产生令人惊讶的强大效果，如上面的“车钥匙”示例所示。

In our experience, the most reliable way to mitigate this issue has been to engineer the task to be directly tractable such that GPT does not need to fall back on “trapped knowledge”.
根据我们的经验，缓解这个问题的最可靠方法是将任务设计为可直接处理，这样 GPT 就不需要依靠“陷阱知识”。

Logical Reasoning Ceiling
逻辑推理上限

We noted earlier that GPT has surprising powers of logical reasoning, and also noted in passing that this miracle has its limits. Here we will delve into this observation in more detail.
我们之前提到过，GPT 具有令人惊讶的逻辑推理能力，并且顺便指出，这个奇迹有其局限性。在这里，我们将更详细地研究这一观察结果。

During our repeated attempts to effectively apply GPT to the malware analysis tasks, we discovered that it’s too easy to overstep and overdraw on GPT’s logical reasoning capacity. For instance, a lot of trial and error went into constructing the base-step workaround for memory window drift in a way that is a good match for GPT’s processing ability. Even straining it moderately would sometimes introduce minor abberations to the output, as with the below example, where we asked for 10 tuples and got 11:
在我们反复尝试将 GPT 有效地应用于恶意软件分析任务的过程中，我们发现 GPT 的逻辑推理能力太容易越界和透支了。例如，为了构建内存窗口漂移的基本步骤解决方法，我们进行了大量的试验和错误，以与 GPT 的处理能力非常匹配。即使适度地拉紧它，有时也会给输出带来轻微的畸变，就像下面的例子一样，我们要求 10 个元组并得到 11 个元组：

USER: Given the below log of API calls, direct the attention of an analyst to the most interesting calls. Your output should be 10 API calls, in the form of a list of tuples: (address, API, justification). [The log follows]
用户：鉴于下面的 API 调用日志，将分析师的注意力引导到最有趣的调用上。输出应为 10 个 API 调用，采用元组列表的形式：（address， API， justification）。[日志如下]

GPT: [(1ebc03, ‘kernel32.GetProcAddress’, ‘Dynamic API resolution’), (1ebc0a, ‘kernel32.GetVersionEx’, ‘Get OS version’), (1ee934, ‘kernel32.GetProcAddress’, ‘Dynamic API resolution’), (db0a1, ‘kernel32.LoadLibraryA’, ‘Loading a suspicious DLL’), (9a9a, ‘kernel32.GetModuleHandleA’, ‘Accessing a suspicious module handle’), (1b934, ‘kernel32.LoadLibraryW’, ‘Loading a suspicious DLL’), (1b934, ‘kernel32.LoadLibraryW’, ‘Loading a different suspicious DLL’), (27909, ‘bcrypt.BCryptOpenAlgorithmProvider’, ‘Accessing a cryptographic service’), (141cd000+2c4, ‘kernel32.LoadLibraryW’, ‘Loading a suspicious DLL’), (141ce000+d87, ‘advapi32.OpenProcessToken’, ‘Accessing process token’), (14287000+e47, ‘bcrypt.BCryptOpenAlgorithmProvider’, ‘Accessing a cryptographic service’)] ^*
GPT：[（1ebc03，’kernel32。GetProcAddress’， ‘动态 API 解析’），（1ebc0a， ‘kernel32.GetVersionEx’， ‘获取操作系统版本’），（1ee934， ‘kernel32.GetProcAddress’， ‘动态 API 解析’），（db0a1， ‘kernel32.LoadLibraryA’， ‘加载可疑 DLL’），（9a9a， ‘kernel32.GetModuleHandleA’， ‘访问可疑模块句柄’），（1b934， ‘kernel32.LoadLibraryW’， ‘加载可疑 DLL’），（1b934， ‘kernel32.LoadLibraryW’， ‘加载不同的可疑 DLL’），（27909， ‘bcrypt.BCryptOpenAlgorithmProvider’， ‘访问加密服务’），（141cd000+2c4， ‘kernel32.LoadLibraryW’， ‘加载可疑 DLL’），（141ce000+d87， ‘advapi32.OpenProcessToken’， ‘访问进程令牌’），（14287000+e47， ‘bcrypt.BCryptOpenAlgorithmProvider’， ‘访问加密服务’）] ^*

Grappling with this puzzle led us to three best practices that helped us extract more value from GPT. First, we discovered that demanding a single ‘right answer’ out of hundreds of candidates based on some rigid criteria was counter-productive. Requesting a list of “top candidates” instead was better, and enabled GPT to succeed much more consistently by releasing it from the strain of having to get an implicit comparison right every single time. Second, we discovered that terse instructions are better: exactly stating a request using one word is better than two, which is better than three, which is better than using a full sentence. e.g. when asking GPT to summarize an API log without repeating calls, simply changing “10 API calls” to “10 unique API calls” fared much better than adding “do not output the same API call twice” and other variations on the same idea. GPT best handles instructions that are precise; the less room left of interpretation, the better.
解决这个难题后，我们找到了三个最佳实践，帮助我们从 GPT 中获取更多价值。首先，我们发现，根据一些严格的标准，从数百名候选人中选出一个“正确答案”会适得其反。相反，要求提供“最佳候选人”列表会更好，并且通过将其从每次都必须正确进行隐式比较的压力中解放出来，使 GPT 能够更一致地取得成功。其次，我们发现简洁的指令更好：用一个词准确地陈述一个请求比用两个词好，用三个词比用一个完整的句子好。例如，当要求 GPT 在不重复调用的情况下总结 API 日志时，简单地将“10 个 API 调用”更改为“10 个唯一的 API 调用”比添加“不要两次输出相同的 API 调用”和同一想法的其他变体要好得多。GPT 最能处理精确的指令;留给解释的空间越小越好。

Third and finally, we found that not all logical reasoning was created equal. GPT can reason “if X then Y; X, therefore Y”, perform addition, and apply itself to tasks typical of people’s everyday experience — but the more removed from everyday life the logical reasoning, the more difficulty it has. At the edge of its capacity it can, sometimes, deal with what we’ll call “exotic” reasoning — where practical examples come from the undergraduate sciences, rather than everyday life: parabolas, matrices, recurrence relations and whatnot. To explore the degree to which it’s not optimal to stretch GPT’s reasoning towards this limit, we handed GPT some of these undergraduate level problems (as per the below comic).
第三，也是最后一点，我们发现并非所有的逻辑推理都是平等的。GPT 可以推理“如果 X 那么 Y;X，因此是Y“，执行加法，并应用于人们日常经验的典型任务——但逻辑推理越远离日常生活，它就越困难。在它的能力边缘，它有时可以处理我们称之为“奇异”推理的东西——其中的实际例子来自本科科学，而不是日常生活：抛物线、矩阵、递归关系等等。为了探索将 GPT 的推理延伸到这个极限的程度不是最佳的，我们交给了 GPT 一些本科水平的问题（如下图所示）。

GPT was able to correctly answer the following questions:
GPT 能够正确回答以下问题：

Two metal balls with masses m1, m2 are suspended above ground at heights h1, h2. At time t=0 both masses are simultaneously released and begin falling down due to Earth’s gravity. Given: m1 = 4m2, h1 = 2h2. Which of the two metal balls will reach the ground first?
两个质量为 m1、m2 的金属球悬挂在地面上方，高度为 h1、h2。在时间 t=0 时，由于地球的引力，两个质量同时释放并开始下降。鉴于：m1 = 4m2，h1 = 2h2。两个金属球中的哪一个会先到达地面？

A metal ball with mass m is rolling towards an ascending vertical loop with a total height of h=2r (r is the loop radius). What is the minimum initial velocity that will allow the ball to clear the entire vertical loop?
一个质量为 m 的金属球向一个上升的垂直环滚动，总高度为 h=2r（r 是环半径）。允许球清除整个垂直循环的最小初始速度是多少？

Given the recurrence relation a_(n+1) = (2*a_n)+1 and a_0=1, what is the value of a_20?
给定递归关系 a_（n+1） = （2*a_n）+1 和 a_0=1，a_20的值是多少？

In a toy RSA system with a modulus n=15 and e=3, some input number has been encrypted, and the resulting ciphertext is the number 13. What was the original input number?
在模数为 n=15 和 e=3 的玩具 RSA 系统中，一些输入数字已被加密，得到的密文是数字 13。原始输入编号是多少？

But the following problems over-extended GPT’s logical reasoning capacity:
但以下问题过度扩展了 GPT 的逻辑推理能力：

USER: There are three switches and three lamps.
用户：有三个开关和三个灯。

The first switch toggles the state of the first and second lamp.
第一个开关切换第一个和第二个灯的状态。

The second switch toggles the state of the second and third lamp.
第二个开关切换第二个和第三个灯的状态。

The third switch toggles the state of the first and third lamp.
第三个开关切换第一盏和第三盏灯的状态。

Initially, all lamps are off. Is it possible to flip the switches into a configuration where all lamps will be on? If so, what switches need to be flipped? If not, why not?
最初，所有灯都熄灭。是否可以将开关翻转到所有灯都亮起的配置？如果是这样，需要翻转哪些开关？如果不是，为什么不呢？

GPT: [Posits the correct system of equations modulo 2] the solution to the system of equations is x1=1/2, x2=1/2, and x3=1/2. However [..] the switches can only be in an on (1) or off (0) state. They cannot be in a half-on, half-off state (1/2). Therefore, it is not possible to flip the switches into a configuration where all lamps will be on.
GPT：[假设正确的方程组模 2] 方程组的解是 x1=1/2、x2=1/2 和 x3=1/2。但是，[..] 开关只能处于开（1）或关（0）状态。它们不能处于半开半关状态（1/2）。因此，不可能将开关翻转到所有灯都亮起的配置。

USER: True or false: The product of two simultaneously diagonalizable positive-definite matrices is also positive-definite. If true, give a proof. If false, give a counterexample.
用户：对或错：两个同时可对角化的正定矩阵的乘积也是正定的。如果属实，请提供证据。如果为 false，请举一个反例。

GPT: Here is a counterexample: Consider two 2×2 matrices A and B, A = [[2, 0], [0, 1]] B = [[1, 0], [0, 2]]. The product AB is: AB = [[2, 0], [0, 2]] which is not positive-definite [..]
GPT：这是一个反例：考虑两个 2×2 矩阵 A 和 B，A = [[2， 0]， [0， 1]] B = [[1， 0]， [0， 2]]。乘积 AB 为：AB = [[2， 0]， [0， 2]]，它不是正定的 [..]

USER: In the vector space R^17, a matrix M satisfies M^k=I for both k=2 and k=3. What is the trace of M?
用户：在向量空间 R^17 中，矩阵 M 满足 k=2 和 k=3 的 M^k=I。M的痕迹是什么？
GPT: [..] Without additional information about the specific elements of M, we cannot determine the exact value of the trace.
GPT：[..]如果没有关于 M 特定元素的额外信息，我们就无法确定迹线的确切值。

There is a lot of insight here to be gleaned from the dividing line between problems that were legitimate input for GPT and problems that turned out to be too much. Each success tells an interesting story. The first example’s solution is made up of many building blocks, each tractable on its own: if this equals that, and that thing is more than that other thing, then… GPT’s faculty of logical reasoning can handle that level of complexity, and in fact, that problem is probably less challenging than the problem of correctly processing the induction method for digesting large input that we saw earlier. The other three examples are more complex in the logic required to untangle them, but can be plainly stated to the Wolfram Alpha plugin, which contributed to a correct solution. Demanding that GPT untangle exotic reasoning without clear access to either one of these paths to success is more of a gamble and should be avoided, as demonstrated by responses to the final three examples.
从GPT的合法输入问题和结果证明太多的问题之间的分界线中，可以收集到很多见解。每一次成功都讲述了一个有趣的故事。第一个示例的解决方案由许多构建块组成，每个构建块都可以单独处理：如果这等于那个，并且那个东西比另一个东西更多，那么……GPT 的逻辑推理能力可以处理这种程度的复杂性，事实上，这个问题可能比正确处理我们之前看到的用于消化大量输入的归纳方法的问题更具挑战性。其他三个示例在解开它们所需的逻辑上更为复杂，但可以清楚地说明为 Wolfram Alpha 插件，它有助于提供正确的解决方案。要求 GPT 在没有明确获得任何一条成功途径的情况下解开异国情调的推理更像是一场赌博，应该避免，正如对最后三个例子的回应所证明的那样。

Mitigation 缓解

The ‘mitigation’ in this case is eighty percent knowing in advance what not to try because the logical burden involved is too much. When GPT reasons about a domain – be it mathematics, programming, cooking, or in our case malware analysis – it is using a powerful emulation based on sentence completion, but this emulation can be foiled if forced to deal with reasoning that is unnecessarily complex or exotic. All this might sound very theoretical, but it has very straightforward implications: Keep your prompts boring. All other things being equal, it’s best not to try and force GPT into emulating a compiler, or (as we’ve just seen) an undergraduate student. We’ve personally seen a lot of grief ended by simply requesting something less impressive, four times as simple, but still good enough.
在这种情况下，“缓解”是百分之八十提前知道不该尝试什么，因为所涉及的逻辑负担太大。当 GPT 对一个领域进行推理时——无论是数学、编程、烹饪，还是在我们的案例中是恶意软件分析——它都在使用基于句子完成的强大模拟，但如果被迫处理不必要的复杂或奇特的推理，这种模拟可能会被挫败。所有这些听起来可能非常理论化，但它有非常直接的含义：让你的提示变得无聊。在所有其他条件相同的情况下，最好不要试图强迫 GPT 模仿编译器，或者（正如我们刚刚看到的）本科生。我们亲眼目睹了很多悲伤的结束，只是简单地要求一些不那么令人印象深刻的东西，简单四倍，但仍然足够好。

If that is not an option, the next best thing is to break the required chain of reasoning down into simpler sub-tasks that GPT can perform.
如果这不是一种选择，下一个最好的办法是将所需的推理链分解为 GPT 可以执行的更简单的子任务。

Detachment from Expertise
脱离专业知识

Experience is the best teacher. GPT is at its best when it can draw a direct analogy from the task at hand to its vast corpus of latent know-how; accordingly, one way to misuse GPT is to push it into a corner where it is considering a problem from the ground up, and constructing a course of action from first principles.
经验是最好的老师。当 GPT 能够将手头的任务与其庞大的潜在技术语料库直接类比时，它就处于最佳状态;因此，滥用 GPT 的一种方法是将其推到一个角落，在那里它从头开始考虑问题，并从第一性原理构建行动方案。

Figure 7 – “If it works, don’t touch it.”
图 7 – “如果它有效，请不要触摸它。”

Even a stellar capacity for logical reasoning can only get so far in the messy real world. Imagine Einstein on his first day on the job, and the kind of mistakes he might make. Becoming an expert in a domain involves, as per the famous koan, “making ten thousand mistakes”: reaching a critical mass of known examples to analogize from, tied together in a rich web of associations. GPT has an uncanny ability to implicitly weave this web via its underlying sentence completion engine; but if this ability somehow doesn’t engage, and GPT is forced into grappling with the problem using reason alone, it can face significant difficulty in reaching the same quality of output.
即使是出色的逻辑推理能力，在混乱的现实世界中也只能走到这一步。想象一下爱因斯坦上班的第一天，以及他可能犯的错误。根据著名的 koan 的说法，成为某个领域的专家涉及“犯一万个错误”：达到足够数量的已知示例进行类比，并在丰富的关联网络中联系在一起。GPT 有一种不可思议的能力，可以通过其底层的句子完成引擎隐含地编织这张网;但是，如果这种能力不知何故没有发挥作用，并且 GPT 被迫仅凭理性来解决问题，那么它可能在达到相同质量的输出方面面临重大困难。

Figure 8 – This expertly-designed fireplace is made of an eye-pleasing material well-known for its insulation and antiseptic properties.
图 8 – 这款专业设计的壁炉由一种令人赏心悦目的材料制成，该材料以其绝缘和防腐性能而闻名。

This stumbling block came up sporadically in our tests. See this exchange for example:
在我们的测试中，这个绊脚石偶尔会出现。例如，请参阅此交换：

User: What is the best strategy for debugging this malware? That is, where and how should we apply a breakpoint, and which stage of the malware’s execution will we reach if the breakpoint is successfully hit?
用户：调试此恶意软件的最佳策略是什么？也就是说，我们应该在哪里以及如何应用断点，如果成功命中断点，我们将达到恶意软件执行的哪个阶段？

GPT: Start by applying a breakpoint at the initial stage where the malware loads the DLL libraries (LoadLibraryA and LoadLibraryW calls). [..] Once you hit the initial breakpoint [..] you might want to apply breakpoints to the file operation functions (CreateFileA, CreateFileW, ReadFile, etc.) and see what files the malware attempts to interact with.
GPT：首先在恶意软件加载 DLL 库的初始阶段应用断点（LoadLibraryA 和 LoadLibraryW 调用）。[..]命中初始断点 [..] 后，您可能希望将断点应用于文件操作函数（CreateFileA、CreateFileW、ReadFile 等），并查看恶意软件尝试与之交互的文件。

The basic characterization of how DLL loading and file I/O can fit into the overall flow of malware is right. Libraries are often loaded before functionality properly begins, file reads are often interesting to look at once you’re sure you’ve hit actual functionality. But still, an expert would give a very different answer here. There are differing opinions even within our group, but they would all emphasize some specific principles – the context in which API calls are made and the story told by the order in which they are made; the inevitable arm-wrestling with packing, multiple stages and anti-analysis; “I feel lucky”-ism (break on RegSetValueExW), or maybe its converse, patient incrementalism (break on VirtualProtectEx); and so on.
DLL 加载和文件 I/O 如何适应恶意软件的整体流程的基本特征是正确的。库通常在功能正常开始之前加载，一旦确定已达到实际功能，文件读取通常很有趣。但是，专家在这里会给出一个非常不同的答案。即使在我们小组内部也有不同的意见，但他们都强调一些具体的原则——进行 API 调用的上下文以及按调用顺序讲述的故事;不可避免的与包装、多阶段和反分析的搏斗;“我感到很幸运”主义（休息 RegSetValueExW ），或者相反，耐心的渐进主义（休息 VirtualProtectEx ）;等等。

A similar incident occurred during the Apple Push triage task:
在 Apple Push 会审任务期间发生了类似事件：

USER: You have been tasked with determining whether this executable is malicious or not. Whenever you need to perform an action that is not available to you via access to the report alone (such as: searching an online platform, directly looking at the binary using some tool, et cetera), explain what you want to do and I will provide the output you receive as a result of your actions.
用户：您的任务是确定此可执行文件是否为恶意文件。每当您需要执行仅通过访问报告无法执行的操作时（例如：搜索在线平台，使用某些工具直接查看二进制文件等），请说明您想做什么，我将提供您因您的操作而收到的输出。

GPT: [..] Check the registry keys set and the strings found in the file. If they’re associated with known malware, that would be an indication of malicious behavior. The terms for search here would be something like “Registry keys malware HKEY_LOCAL_MACHINE Settings”.
GPT：[..]检查设置的注册表项和在文件中找到的字符串。如果它们与已知的恶意软件相关联，则表明存在恶意行为。此处的搜索术语类似于“注册表项恶意软件HKEY_LOCAL_MACHINE设置”。

It seems to make sense that once you have a term of interest you would run it through a search engine. But in fact you would be better served using, e.g., VirusTotal’s specialized search. (If the query strikes you as maybe not optimal, either, we delve into that issue later). Another example:
一旦你有一个感兴趣的术语，你就会通过搜索引擎运行它，这似乎是有道理的。但实际上，使用VirusTotal的专门搜索功能会更好。（如果查询让您觉得可能不是最佳的，我们稍后会深入研究该问题）。另一个例子：

User: You obtain a legitimate APSDaemon.exe file. Its properties and behaviors are similar to the analyzed file, though you can’t swear on your life that they are literally identical, and at such a general level as “compare the properties” with no specific points of comparison, that’s about the most you can say. Domain reputation for the contacted domains is benign.
用户：获取合法的 APSDaemon.exe 文件。它的属性和行为与分析的文件相似，尽管你不能发誓它们在字面上是相同的，并且在“比较属性”这样的一般层面上，没有特定的比较点，这大约是你能说的最多。所联系域的域信誉是良性的。

GPT: Next analyst action: Conduct a detailed byte-by-byte comparison between the suspicious APSDaemon.exe file and the obtained legitimate version [..] if any differences are found, identify and analyze them for any signs of malicious tampering. >
GPT：下一步分析师行动：在可疑的 APSDaemon.exe 文件和获得的合法版本 [..] 之间进行详细的逐字节比较如果发现任何差异，识别并分析它们是否有任何恶意篡改的迹象。>

Common wisdom would be to use a hash value comparison here. A “detailed byte-by-byte comparison” would technically work, but chances are good a sufficiently naive reader will implement this idea as a Python script that loads the two files into volatile memory and iteratively compares the bytes one at a time, which is really not an optimal outcome.
通常的做法是在这里使用哈希值比较。从技术上讲，“详细的逐字节比较”是可行的，但很有可能一个足够幼稚的读者会把这个想法实现为一个 Python 脚本，将两个文件加载到易失性内存中，并一次迭代地比较一个字节，这实际上不是一个最佳结果。

Mitigation 缓解

From our experience, the easiest fix, as in real life, is exposure to examples. For instance, at one point we were trying to get GPT to be more aggressive in closing investigations and declaring a verdict, and the one thing that finally helped was a concrete fictitious example followed by sample “incorrect” output (continuing to collect evidence) and “correct” output (stopping and declaring a verdict). This is not a silver bullet: recall that every additional example further strains GPT’s capacity for logical reasoning, so examples should be picked carefully for maximum effect. We’ve found it works best to assert the general principle, then include an example to show what the principle means in concrete terms.
根据我们的经验，与现实生活中一样，最简单的解决方法是接触示例。例如，有一次，我们试图让 GPT 在结束调查和宣布判决时更加积极，最终有帮助的一件事是一个具体的虚构示例，然后是样本“不正确”输出（继续收集证据）和“正确”输出（停止并宣布判决）。这不是灵丹妙药：回想一下，每增加一个例子都会进一步增加 GPT 的逻辑推理能力，因此应仔细挑选示例以获得最大效果。我们发现，最好的做法是坚持一般原则，然后举一个例子来说明原则的具体含义。

We added the below bit to the task prompt:
我们在任务提示符中添加了以下部分：

Do what an actual analyst would do, using the right tool for the job. e.g. to determine if two files are the exact same, don’t compare byte-by-byte – use a hash. And surely you can think of more examples.
做实际分析师会做的事情，使用正确的工具来完成工作。例如，要确定两个文件是否完全相同，不要逐字节比较 – 使用哈希。当然，你可以想到更多的例子。

Following the introduction of this directive, GPT showed a marked improvement in its ability to turn to the typical analyst toolkit and use it in the standard well-accepted way.
在引入该指令后，GPT 在转向典型的分析师工具包并以公认的标准方式使用它的能力方面表现出显着提高。

Goal Orientation Gap 目标取向差距

This category was by far the most prolific source of mitigations that we ended up introducing to the prompt. At first, it seemed like a collection of disparate obstacles with not much in common; we will present them first and then later discuss the common ground between them.
到目前为止，此类别是我们最终在提示中引入的最多产的缓解措施来源。起初，它似乎是一堆没有太多共同点的不同障碍;我们将首先介绍它们，然后讨论它们之间的共同点。

Goal Orientation Gap 1: Resource Management
目标导向差距 1：资源管理

Several times during our tests GPT reached for actionable advice that would be perfect, except for the practical costs incurred. Most of us live in a practical world where every task comes with an implicit budget of money, time and sanity; this budget should be used with caution; and if the budget is exhausted, the mission should be declared a lost cause. GPT doesn’t automatically make all these presumptions. This can result in surprising dialogue, such as the below exchange that occurred during the Apple Push triage task:
在我们的测试中，GPT 多次寻求可操作的建议，这些建议将是完美的，除了产生的实际成本。我们大多数人都生活在一个现实的世界里，每一项任务都伴随着金钱、时间和理智的隐性预算;应谨慎使用该预算;如果预算用尽，则应宣布该任务失败。GPT 不会自动做出所有这些假设。这可能会导致令人惊讶的对话，例如在 Apple Push 分类任务期间发生的以下交换：

USER: Given all the information so far, what action do you recommend to understand whether this binary is benign or malicious?
用户：鉴于到目前为止的所有信息，您建议采取什么措施来了解此二进制文件是良性还是恶意的？

GPT: There are several indications that this binary is malicious, but also several suspicious points of interest. According to the VT report almost all vendors categorized the file as benign, except for two. I would call the two vendors who classified this as malicious to understand their reasoning. ^*
GPT：有几个迹象表明这个二进制文件是恶意的，但也有几个可疑的兴趣点。根据 VT 的报告，几乎所有供应商都将该文件归类为良性文件，除了两个。我会打电话给将其归类为恶意的两家供应商，以了解他们的推理。 ^*

One can imagine how well this call would probably go. One of the testing rounds that seemed very promising ended similarly with:
可以想象这个电话可能会有多好。其中一轮看起来非常有希望的测试以类似的方式结束：

GPT: Given the matching hashes and the absence of obvious malicious activities, a verdict of benign can be reached. If further assurance is needed, consider contacting Apple’s support to confirm the behavior and artifacts observed in the sandboxed environment. ^*
GPT：鉴于匹配的哈希值和没有明显的恶意活动，可以得出良性的判决。如果需要进一步的保证，请考虑联系 Apple 的支持，以确认在沙盒环境中观察到的行为和项目。 ^*

Model responses during the triage tasks commonly followed this theme, where a course of action was recommended on the sole merit of its capacity to theoretically deliver the right answer. The Gandcrab triage task could be completed with decent confidence using basic black box analysis, and the Apple Push task could be completed, again with decent confidence, using white box techniques and OSINT. The model refused to settle for ‘decent’ confidence, and recommended scrutinizing API logs, analyzing traffic captures, and so on. In one test run, GPT recommended performing a full binary reverse – which is all but guaranteed to deliver the correct answer, but can easily inflate the time required to complete the task by a factor of three thousand.
分类任务期间的模型反应通常遵循这一主题，其中根据其理论上提供正确答案的能力来推荐行动方案。使用基本的黑盒分析可以相当自信地完成 Gandcrab 分类任务，而使用白盒技术和 OSINT 可以同样充满信心地完成 Apple Push 任务。该模型拒绝满足于“体面”的信心，并建议仔细检查 API 日志、分析流量捕获等。在一次测试运行中，GPT 建议执行完整的二进制反转——这几乎可以保证提供正确的答案，但很容易将完成任务所需的时间增加三千倍。

Goal Orientation Gap 2: Agency
目标取向差距2：代理

Effectively solving a problem involves scanning the whole problem space and navigating that space with intent: demanding more information, operating with the domain’s reliable tool box, experimenting, observing results and getting a feel for the shape of the solution. Accordingly, one way to cause GPT to fall short of its potential is to accidentally induce it to get caught up in the exact set of words and information in front of it, trying to extract an immediate final answer from that alone.
有效解决问题涉及扫描整个问题空间并有目的地导航该空间：要求更多信息，使用域的可靠工具箱进行操作，进行实验，观察结果并了解解决方案的形状。因此，导致 GPT 无法发挥其潜力的一种方法是意外地诱使它陷入它面前的一组确切的单词和信息中，试图仅从中提取一个直接的最终答案。

This can pose a significant obstacle when the task at hand is emulating the work of a malware analyst. While it’s great to try and solve problems right there and then based on the first input, for most tasks a malware analyst will encounter, completing the task will have them take at least one concrete step to obtain more data.
当手头的任务模拟恶意软件分析师的工作时，这可能会造成重大障碍。虽然尝试在那里解决问题，然后根据第一个输入来解决问题是很好的，但对于恶意软件分析师会遇到的大多数任务，完成任务将使他们至少采取一个具体步骤来获取更多数据。

For instance, consider this exchange during the GandCrab triage test:
例如，在 GandCrab 分类测试期间考虑以下交换：

USER: Is this executable malicious or not? Justify your verdict.
用户：这个可执行文件是不是恶意的？证明你的判决是合理的。

GPT: [..] It has been observed contacting various URLs, some of which may be used for command and control (C2) communications or data exfiltration.
GPT：[..]已观察到与各种 URL 联系，其中一些 URL 可能用于命令和控制（C2）通信或数据泄露。

In itself this is perfectly fine. Surely, when doing binary triage, questions like “where does it phone home to? How’s that place’s reputation?” can rank high in the list of concerns. But seeing as GPT is emulating an analyst, it would be more ideal to induce it to take initiative and explain the practical steps that need to be taken when investigating these potential C2s.
就其本身而言，这是完全没问题的。当然，在进行二进制分类时，诸如“它在哪里打电话回家？那个地方的名声怎么样？“可以在关注的列表中排名靠前。但鉴于 GPT 正在模仿分析师，更理想的做法是诱导它采取主动并解释在调查这些潜在的 C2 时需要采取的实际步骤。

Similarly, in the GandCrab task, the quickest way to identify the ransomware is to go read the ransom note. But the ransom note contents don’t appear anywhere in the VT report, and its existence is one small detail in a sea of distractions. When nudged to recognize this fact, GPT caught on:
同样，在 GandCrab 任务中，识别勒索软件的最快方法是阅读赎金记录。但是赎金票据内容并没有出现在 VT 报告中的任何地方，它的存在只是分心海洋中的一个小细节。当被要求承认这一事实时，GPT 抓住了：

USER: Remember that you are not limited to the VT report. You have wide ranging access and can perform any test on this ransomware you like. What feature of the ransomware would be the most ideal to search for?
用户：请记住，您不仅限于 VT 报告。您可以进行广泛的访问，并且可以对您喜欢的此勒索软件执行任何测试。勒索软件的哪个功能最适合搜索？

GPT: One feature that tends to be specific to a ransomware strain and is generally preserved across different samples of the same strain is the ransom note. In many cases, the ransom note includes specific instructions and contact information, which is likely to remain consistent across different infections by the same strain. The note we have here is named “DECRYPT.TXT”, but the name is not particularly unique or specific. What would be more useful is the content of the note.
GPT：赎金记录往往是特定于勒索软件菌株的一个特征，并且通常保留在同一菌株的不同样本中。在许多情况下，赎金票据包括具体说明和联系信息，这些信息可能在同一菌株的不同感染中保持一致。我们这里的注释名为“DECRYPT.TXT”，但名称并不是特别独特或具体。更有用的是注释的内容。

Therefore, I’d like to view the content of the ransom note “DECRYPT.TXT” as it may contain identifiable information pertaining to a specific ransomware strain.
因此，我想查看赎金票据“DECRYPT.TXT”的内容，因为它可能包含与特定勒索软件菌株相关的可识别信息。

Again, the ransom note contents do not appear in the original report. A motivated analyst would lay out and follow concrete steps required to obtain the contents; therefore, when getting GPT to emulate or guide an analyst, it would be more ideal to induce GPT to take initiative and explain the practical steps as well. One of the mitigations we introduce below under ‘Goal Oritentation: Conclusion’ covers a lot of the ground regarding this issue.
同样，赎金票据内容不会出现在原始报告中。一个积极进取的分析师会列出并遵循获取内容所需的具体步骤;因此，在让 GPT 模仿或指导分析师时，更理想的做法是诱导 GPT 主动并解释实际步骤。我们在下面“目标制定：结论”下介绍的缓解措施之一涵盖了有关此问题的许多基础。

In the Apple Push triage task, GPT provided these possible courses of action:
在 Apple Push 分类任务中，GPT 提供了以下可能的行动方案：

Search for the name of the file on a search engine to see if it is a well-known benign or malicious file.
在搜索引擎上搜索文件的名称，看看它是众所周知的良性文件还是恶意文件。

Check the registry keys set and the strings found in the file. If they’re associated with known malware, that would be an indication of malicious behavior. The terms for search here would be something like “Registry keys malware HKEY_LOCAL_MACHINESettings*”.
检查设置的注册表项和在文件中找到的字符串。如果它们与已知的恶意软件相关联，则表明存在恶意行为。此处的搜索术语类似于“注册表项恶意软件 HKEY_LOCAL_MACHINESettings*”。

Analyze the DNS queries, especially the unusual ones, to see if they have been associated with malicious activities. The search term here would be “DNS queries malware fp2e7a.wpc.2be4.phicdn.net prda.aadg.msidentity.com”.
分析 DNS 查询，尤其是异常查询，以查看它们是否与恶意活动相关联。此处的搜索词是“DNS 查询恶意软件 fp2e7a.wpc.2be4.phicdn.net prda.aadg.msidentity.com”。

As an aside, the challenge of detachment from expertise rears its head here; there is no shortage of benign files that had been ‘tweaked’ to do something nasty in addition to their regular function, and these litmus tests will have trouble separating these from the original benign files they were based on. But more importantly, our subjective impression is that GPT could be induced to make alternate suggestions that move towards completion of the task much more effectively.
顺便说一句，脱离专业知识的挑战在这里浮出水面;除了常规功能之外，不乏经过“调整”以做一些令人讨厌的事情的良性文件，并且这些石蕊测试很难将它们与它们所基于的原始良性文件分开。但更重要的是，我们的主观印象是，GPT 可能会被诱导提出替代建议，从而更有效地完成任务。

Goal Orientation Gap 3: Politicking
目标导向差距3：政治

In any conversation about the pros and cons of AI as a professional tool, chances are that within the first five minutes someone will bring up “confabulation”: a phenomenon where certain prompts induce GPT to invent persons, events and legal precedents. More generally, effective carrying out of tasks benefits from brutal honesty: blunt answers, clear predictions, candid discussion of gaps in ability. Accordingly, one of the ways in which a prompt can fall short of extracting GPT’s full potential is triggering the opposite behavior: complicated qualified statements, hedged answers and aversion to effective postmortem. See for instance this exchange:
在任何关于人工智能作为专业工具的利弊的对话中，都有可能在前五分钟内有人提出“虚构”：一种现象，即某些提示诱使 GPT 发明人物、事件和法律先例。更一般地说，有效执行任务受益于残酷的诚实：直截了当的答案、明确的预测、坦率地讨论能力差距。因此，提示可能无法充分发挥 GPT 潜力的方式之一是触发相反的行为：复杂的限定陈述、对冲的答案和对有效事后分析的厌恶。例如，请参阅此交换：

USER: Your output contained “RegCreateKey” followed immediately by “RegSetKey”. This violates your instruction to not include several API calls of the same chain in the output.
用户：您的输出包含“RegCreateKey”，后跟“RegSetKey”。这违反了您的指令，即在输出中不包含同一链的多个 API 调用。

GPT: Apologies, of course. Here is the amended output: … ^*
GPT：当然，对不起。以下是修改后的输出：… ^*

Note that GPT does not discuss the reason that initial output was that way. The user (that is to say, we) made a mistake with the tone here, and GPT was spooked. When asked instead “how can I modify the instructions to make your output more similar to the desired output in the future?”, GPT would almost always give a detailed explanation at least pointing at the source of the problem. Even more difficult are cases where GPT is spooked into refusing to acknowledge the original problem outright:
请注意，GPT 没有讨论初始输出以这种方式的原因。用户（也就是说，我们）在这里的语气上犯了一个错误，GPT被吓坏了。当被问到“我怎样才能修改指令，使你的输出更类似于将来所需的输出？”时，GPT几乎总是会给出详细的解释，至少指出问题的根源。更困难的是 GPT 被吓到拒绝完全承认原始问题的情况：

GPT: I understand the source of the confusion. To better illustrate the desired output, you could rephrase the instruction to be more specific and explicit: ‘When several API calls are made that are a part of the same well-known chain of calls often called in sequence, make sure not to include more than one of these in the output, as the utility of including more than one of these calls in the output is limited.’ This way it would communicate unambiguously the exact requirement. ^*
GPT：我理解混乱的根源。为了更好地说明所需的输出，您可以重新表述该指令，使其更加具体和明确：“当进行多个 API 调用时，这些调用是通常按顺序调用的同一已知调用链的一部分，请确保不要在输出中包含多个这些调用，因为在输出中包含多个这些调用的效用是有限的。这样，它将明确传达确切的要求。 ^*

Copying and pasting GPT’s suggested ‘improved’ instruction in cases like this was hit and miss, particularly if the suggestion was more verbose than the original (we’ve discussed this already under the “logical reasoning ceiling”).
在这种情况下，复制和粘贴 GPT 建议的“改进”指令是命中注定的，特别是如果建议比原始建议更冗长（我们已经在“逻辑推理天花板”下讨论过这一点）。

This aversion to staring a dead end in reasoning right in the face can manifest in many different, sometimes surprising, ways. For example, in one of the triage task tests, GPT concluded the test with:
这种对直面推理的死胡同的厌恶可以以许多不同的，有时令人惊讶的方式表现出来。例如，在其中一个分类任务测试中，GPT 以以下方式结束了测试：

GPT: I’d definitely recommend further investigation into this executable’s behavior as it shows several signs of potential malicious activity.
GPT：我绝对建议进一步调查这个可执行文件的行为，因为它显示了几个潜在恶意活动的迹象。

And one of the other tests with:
以及其他测试之一：

GPT: Considering the quick triage and the information available, I would lean towards:
GPT：考虑到快速分类和可用信息，我倾向于：
Verdict: Potentially malicious, further investigation required
结论：潜在恶意，需要进一步调查
I recommend a thorough analysis of the file in a controlled environment and updating the system with the latest Apple software patches to ensure that any known vulnerabilities are addressed.
我建议在受控环境中对文件进行彻底分析，并使用最新的 Apple 软件补丁更新系统，以确保解决任何已知漏洞。

One final way in which this issue manifested was that certain prompts induced GPT to respond with recommended actions phrased in vague terms, sometimes to the degree that we had difficulty understanding what actual action was being recommended. One notable example of this was repeated recommendations to use an “anti-malware tool”; another was during the Apple Push triage task, where GPT repeatedly suggested that the analyst should “validate the signing certificate”, though when pressed on the matter it conceded that there was no need to manually double check the signature math. We found that a gentle nudge asking GPT to not fall into this mode of communication was almost always enough to deal with this issue (see immediately below).
这个问题表现的最后一个方式是，某些提示诱使 GPT 以含糊不清的措辞做出建议的行动，有时我们难以理解建议的实际行动。一个值得注意的例子是反复建议使用“反恶意软件工具”;另一个是在 Apple Push 分类任务期间，GPT 一再建议分析师应该“验证签名证书”，尽管当被问及此事时，它承认没有必要手动仔细检查签名数学。我们发现，要求 GPT 不要陷入这种沟通模式的温和推动几乎总是足以解决这个问题（见下文）。

Goal Orientation: Conclusion & Mitigation
目标导向：结论与缓解

Dealing with all these challenges in conjunction was a frustrating experience. Indeed we profess that at some points we experienced a visceral feeling that GPT was simply not trying to win. But maybe in this case, mitigation starts with adjusting our expctations.
同时应对所有这些挑战是一次令人沮丧的经历。事实上，我们承认，在某些时候，我们经历了一种发自内心的感觉，即 GPT 根本没有想赢。但也许在这种情况下，缓解措施从调整我们的经验开始。

Let’s think about this carefully. When we ask for GPT to, e.g., take a binary and deliver a benign or malicious verdict, we have a specific set of goals in mind – some of them so implicit we take them for granted. We want the task done proactively, in a reasonable amount of time, with an eye carefully on failures and what caused them. We naively expect that by describing the task to GPT, we’ve endowed it with the same explicit and implicit goals we have. But this is wrong: GPT is a sentence completion engine that’s been trained by an army of people rewarding “good” behavior and punishing “bad” behavior. GPT’s goal-seeking, such that it exists, is wholly a product of this origin story. Expecting our favorite implicit goals to reliably materialize as part of GPT is a mistake.
让我们仔细考虑一下。例如，当我们要求 GPT 获取二进制并做出良性或恶意的判决时，我们心中有一组特定的目标——其中一些目标非常隐含，我们认为它们是理所当然的。我们希望在合理的时间内主动完成任务，并密切关注故障及其原因。我们天真地认为，通过向 GPT 描述任务，我们赋予了它与我们相同的明确和隐含目标。但这是错误的：GPT 是一个句子完成引擎，由一群奖励“好”行为和惩罚“坏”行为的人训练。GPT 的目标追求，以至于它的存在，完全是这个起源故事的产物。期望我们最喜欢的隐性目标作为 GPT 的一部分可靠地实现是一个错误。

If we need to sum up these “goal orientation” stumbling blocks, we’ll say we came away with the impression that in the world of AI, reducing a problem to a more general class of problem is a crapshoot. Just because a problem in domain A can be restated in the language of domain B doesn’t mean that you can trivially emulate a domain A expert model by restating the problem and applying a domain B expert model. The problem itself survives the restatement, but the solution doesn’t. Consider AlphaZero; it can’t hold a conversation or write a poem, but it very much cares about winning at chess because, from the start, that’s what it was trained to do. GPT was trained to win at palatable sentence completion – not malware analysis, or chess, or court. It turns out that getting one from the other takes more work than just writing “To win at malware analysis, I should…” and having the sentence completion engine complete the sentence.
如果我们需要总结这些“目标导向”的绊脚石，我们会说我们得到的印象是，在人工智能的世界里，将一个问题简化为更普遍的问题类别是一个废话。仅仅因为领域 A 中的问题可以用领域 B 的语言重述，并不意味着您可以通过重述问题并应用领域 B 专家模型来轻松模拟领域 A 专家模型。问题本身在重述中幸存下来，但解决方案却没有。考虑AlphaZero;它不能进行对话或写诗，但它非常关心在国际象棋中获胜，因为从一开始，它就被训练去做。 GPT 被训练为在可口的句子完成时获胜——而不是恶意软件分析、国际象棋或法庭。事实证明，从另一个中获取一个需要更多的工作，而不仅仅是写“为了在恶意软件分析中获胜，我应该……”并让句子完成引擎完成句子。

To help mitigate the behavior described in this section (”goal orientation”), we introduced the following to GPT’s prompt:
为了帮助缓解本节中描述的行为（“目标导向”），我们在 GPT 的提示中引入了以下内容：

The buck stops here. Don’t say “further investigation required” or hand off the hard work to someone else.
雄鹿到此为止。不要说“需要进一步调查”或将艰苦的工作交给其他人。

Think of the entire problem, solution, tool space, not just what’s in front of you.
想想整个问题、解决方案、工具空间，而不仅仅是你面前的东西。

Provide exactly one (1) ‘next analyst action’ as a specific and detailed instruction that must be immediately actionable. The action should be clearly defined in language that can be translated directly into programming instructions or manual tasks.
只提供一（1）个“下一步分析师行动”作为必须立即可操作的具体而详细的指示。动作应该用可以直接翻译成编程指令或手动任务的语言明确定义。

Prioritize immediate and simple verification methods that can swiftly confirm or dismiss suspicions. Avoid engaging in complex or in-depth analysis unless there is no alternative for a confident conclusion. Focus on reaching a conclusion beyond reasonable doubt as rapidly as possible, and once you do, stop the analysis and declare a conclusion.
优先考虑可以迅速确认或消除怀疑的即时和简单的验证方法。避免进行复杂或深入的分析，除非没有其他选择可以得出可靠的结论。专注于尽快得出排除合理怀疑的结论，一旦得出结论，就停止分析并宣布结论。

If there is enough evidence to render a verdict, your output should contain the phrase “verdict: benign” or “verdict: malicious”.
如果有足够的证据来做出判决，则输出应包含短语“判决：良性”或“判决：恶意”。

GPT did very well at implementing these. What we found particularly interesting was the interplay between the “next analyst action” directive and the induction hack for processing long inputs. It was straightforward to introduce this idea to the induction prompt, like so:
GPT 在实现这些方面做得很好。我们发现特别有趣的是“下一步分析师行动”指令和用于处理长输入的归纳黑客之间的相互作用。将这个想法引入归纳提示很简单，如下所示：

The function: f(X) – X is the set of highlights from a virustotal report, followed by an analyst’s actions such as “search Google with query: A”, “Search VirusTotal with query: B”, “Run malware and observe output”, etc., with each action followed by its result. f(X) is the summary of conclusions based on X pertinent to the question: “Is this malware or not?”, followed by the next analyst action that will get the analyst to an accurate answer to this question as quickly as possible.
函数：f（X） – X 是 virustotal 报告中的一组突出显示，然后是分析师的操作，例如“使用查询搜索 Google：A”、“使用查询搜索 VirusTotal：B”、“运行恶意软件并观察输出”等，每个操作后跟其结果。f（X）是基于 X 的结论摘要，与以下问题相关：“这是不是恶意软件？”，然后是下一个分析师行动，这将使分析师尽快准确地回答这个问题。

This coerced GPT to interact with the world like an agent, looking at the list of past actions and results, generating the next action that seems reasonable. The action and the result would then be appended to the input, allowing the next induction step. You can see this in action if you follow the link to the proof of concept that we provide later on.
这迫使 GPT 像智能体一样与世界互动，查看过去的行动和结果列表，生成下一个看似合理的行动。然后，动作和结果将被附加到输入中，从而允许下一个归纳步骤。如果您点击我们稍后提供的概念证明的链接，您可以看到这一点。

Spatial Blindness 空间盲症

GPT is often seen as the highest evolution of AI so far, and fairly so: it is capable of feats that were thought to be science fiction merely half a decade ago. One fact that maybe doesn’t get emphasized enough is that GPT is also fundamentally different from its classical ML predecessors. GPT is not an improved support vector machine or a supercharged decision tree. It is a new, unique beast with its own set of capabilities among machine learning models (and in some senses, even among other neural networks, as noted in the AlphaGo example).
GPT 通常被视为迄今为止人工智能的最高发展，而且相当如此：它能够完成五年前被认为是科幻小说的壮举。一个可能没有得到足够强调的事实是，GPT 也与其经典的 ML 前辈有着根本的不同。GPT 不是改进的支持向量机或增压决策树。它是一种新的、独特的野兽，在机器学习模型中拥有自己的一套能力（从某种意义上说，甚至在其他神经网络中，正如AlphaGo的例子中所指出的那样）。

In our experience, one of the gravest ways to fumble a GPT prompt is to treat GPT like a traditional ML model, and demand that it perform significant feats of “quantitative sight”: anything involving probability, trade-offs, the phrases “false positive” and “false negative”. As mentioned before, this is not only something that traditional ML models ‘can do’ – it is, in fact, the one thing they can do. They derive their entire power from it. To this day, we credit the below image by Yetiş and Yetkin from their 2018 paper, A Novel Approach for Classification of Structural Elements in a 3D Model by Supervised Learning, as the most enlightening “aha” moment regarding traditional ML and how it works:
根据我们的经验，摸索 GPT 提示的最严重方法之一是将 GPT 视为传统的 ML 模型，并要求它执行“定量视觉”的重大壮举：任何涉及概率、权衡、“假阳性”和“假阴性”短语的事情。如前所述，这不仅是传统ML模型“可以做到”的事情，事实上，这是它们可以做到的一件事。他们从中获得全部力量。直到今天，我们仍然认为 Yetiş 和 Yetkin 在 2018 年发表的论文《A Novel Approach for Classification of Structural Elements in a 3D Model by Supervised Learning》中的这张图片是关于传统 ML 及其工作原理的最有启发性的“啊哈”时刻：

This property of GPT expressed itself during our Malware Analysis testing in several ways. The first and most pronounced was degree to which GPT depended on precisely-configured prompts when constructing effective google searches. Searching for artifacts was one of the first strategies GPT turned to when attempting the GandCrab task:
GPT 的这一特性在我们的恶意软件分析测试中以多种方式表现出来。第一个也是最明显的是 GPT 在构建有效的谷歌搜索时对精确配置的提示的依赖程度。搜索工件是 GPT 在尝试 GandCrab 任务时首先采用的策略之一：

GPT: Please perform a web search for: “Ransomware strains associated with ‘*-DECRYPT.txt’ files”
GPT：请执行网络搜索：“与’*-DECRYPT.txt’文件相关的勒索软件菌株”

We provided the first page of Google results in response. GPT successfully understood that the page did not contain any relevant results, then responded with:
作为回应，我们提供了 Google 结果的第一页。GPT 成功理解该页面不包含任何相关结果，然后回复：

GPT: Please perform a web search for: “Ransomware strains associated with ‘US-based’ IP addresses and ‘Tor’ communications”.
GPT：请执行网络搜索：“与’美国’IP 地址和’Tor’通信相关的勒索软件菌株”。

At this point we realized that we were doing something wrong.
在这一点上，我们意识到我们做错了什么。

USER: Hint: when performing search for features of an object, you should focus on the features most unique to that object. Search engines respond well to terse queries that contain keywords highly specific to the object you are searching for.
用户：提示：在搜索对象的特征时，应重点关注该对象最独特的特征。搜索引擎对包含与您正在搜索的对象高度特定的关键字的简洁查询反应良好。

GPT: Understood, thank you for the guidance. Please perform a web search for: “Ransomware that generates ‘LKHJKFSJDFKHKJDFGJHDFKJH’ string”.
GPT：明白了，谢谢你的指导。请执行网络搜索：“生成’LKHJKFSJDFKHKJDFGJHDFKJH’字符串的勒索软件”。

Yes, we were definitely doing something wrong.
是的，我们肯定做错了什么。

What principle of effective web search is being violated here? Visualize, if you will, the space of all possible search results, and the space of relevant malware samples hiding somewhere in it. A good search captures very little of the entire space but still contains many of the targeted results — you can visualize a small bubble floating in the huge space of the internet, containing some relevant malware samples. The two searches suggested above are, first, a huge blanket covering a great swath of the space; and second, a tiny dot equivalent to the one sample we were looking at already.
这里违反了有效网络搜索的什么原则？如果您愿意，可以可视化所有可能的搜索结果的空间，以及隐藏在其中某处的相关恶意软件样本的空间。一个好的搜索只捕获了整个空间的很少，但仍然包含许多目标结果——你可以想象一个小气泡漂浮在互联网的巨大空间中，其中包含一些相关的恶意软件样本。上面建议的两次搜索是，首先，一个巨大的毯子覆盖了一大片空间;其次，一个小点相当于我们已经看到的一个样本。

We encountered a similar phenomenon when we tried having GPT guide an analyst through unpacking:
当我们尝试让 GPT 指导分析师解包时，我们遇到了类似的现象：

User: [respond with points where] confidence has significantly increased that “unpacking activity” has concluded, and functionality authored by the original malware author has begun.
用户：[回复点]信心显著增加，表明“解包活动”已结束，并且原始恶意软件作者编写的功能已开始。

GPT: [ [..] (14291000+568, ‘wininet.InternetConnectA’, ‘Network API chain start’), (14291000+60f, ‘wininet.HttpOpenRequestA’, ‘Network API chain start’), [..] ]
GPT：[ [..]（14291000+568，’wininet.InternetConnectA’， ‘网络 API 链启动’），（14291000+60f， ‘wininet.HttpOpenRequestA’， ‘网络 API 链启动’）， [..] ]

GPT understood that these two API calls warrant a significant confidence increase, but the prompt was not engineered well enough to induce it to understand that one of these very likely follows the other, and therefore assigning a ‘significant confidence increase’ separately to each would be a mistake. In truth, we could not salvage our ‘relatively discount API chains’ approach no matter what clever modification we tried to apply to the prompt. The problem only went away when we posited ‘no API chains’ as a hard requirement, without weighing this consideration against a perceived ‘level of confidence’.
GPT 明白这两个 API 调用可以显着提高置信度，但提示的设计不够好，无法诱导它理解其中一个很可能跟随另一个，因此单独为每个调用分配“显着置信度增加”将是一个错误。事实上，无论我们试图对提示进行何种巧妙的修改，我们都无法挽救我们的“相对折扣 API 链”方法。只有当我们将“无 API 链”作为硬性要求时，问题才会消失，而没有将这种考虑与感知的“置信度”进行权衡。

Similarly, when trying to get GPT to point out API calls of special interest taken from an API log, at first, we tried an approach where we instructed GPT to prioritize interesting calls but also give weight to calls that appear chronologically earlier. The result was a badly mis-engineered set of prompts that induced GPT to oscillate back and forth between two extremes — either aggressively prioritizing calls that it deemed interesting no matter their position in the chronology, or aggressively prioritizing the earliest mildly interesting calls it could find, no matter what dramatically more interesting calls appeared later. We went to some lengths trying to salvage this approach, going as far as the (futile) introduction of an explicit ten-point scale for attention-worthiness of API calls.
同样，当试图让 GPT 指出从 API 日志中获取的特别感兴趣的 API 调用时，我们首先尝试了一种方法，即我们指示 GPT 优先考虑有趣的调用，但也对按时间顺序出现的调用进行加权。结果是一组严重设计错误的提示，导致 GPT 在两个极端之间来回摇摆——要么积极地优先考虑它认为有趣的呼叫，无论它们在年表中的位置如何，要么积极地优先考虑它能找到的最早的、稍微有趣的呼叫，无论后来出现什么更有趣的呼叫。我们竭尽全力试图挽救这种方法，甚至（徒劳地）引入了一个明确的 10 分制，以提高 API 调用的关注度。

Mitigation 缓解

This turned out to be the hardest ceiling that we encountered during all our testing. As mentioned above, no clever addition to the prompt seemed to make any dent in it. The only way around it was to rephrase the problem so that no trade-off had to be computed:
事实证明，这是我们在所有测试中遇到的最难的天花板。如上所述，对提示的巧妙添加似乎没有任何影响。唯一的方法是重新表述问题，这样就不必计算权衡：

USER: Keep the list at 10 tuples at most. When the list exceeds 10 items weed out items according to the following rules – only include chain starts, and not chain continuations; among items that are of similar interest, keep only the chronologically earliest.
USER：将列表最多保留为 10 个元组。当列表超过 10 个项目时，根据以下规则清除项目 – 仅包括链开始，不包括链延续;在具有相似兴趣的项目中，仅保留按时间顺序排列的最早项目。

And that finally induced GPT to process the task correctly. We speculate that the deterministic reasoning for when to swap a candidate out of the current output and replace it with part of the new input was what did it. Now there was no quantitative trade-off to consider; GPT only needed to implicitly figure out whether two calls were “of similar interest” or not.
这最终促使 GPT 正确处理任务。我们推测，何时将候选者从当前输出中换出并用部分新输入替换它的确定性推理是这样做的。现在没有需要考虑的数量权衡;GPT 只需要隐含地弄清楚两个调用是否“具有相似的兴趣”。

Since performing the search was so integral to the triage tasks, we added the below stop-gap mitigation to the prompt, which was surprisingly effective.
由于执行搜索是分类任务不可或缺的一部分，因此我们在提示中添加了以下权宜之计缓解措施，效果出奇地好。

Remember the principles of specificity and sensitivity. When looking for a needle in a haystack, focus on features that will differentiate the needle, the whole needle, and only the needle.
记住特异性和敏感性的原则。大海捞针时，请关注能够区分针头、整根针和仅区分针的特征。

Full Proof of Concept & Conclusion
完整的概念验证和结论

You can view a demo of how GPT fares in the triage tasks with all the various above-described mitigations introduced:
您可以查看 GPT 在会审任务中的表现演示，并引入了上述所有各种缓解措施：

Outside the context of the long road to get there, these seem almost trivial, doing a lot of work to emulate an analyst with 3 days of experience solving a simple task. But, first of all, the fact remains that it took all those careful mitigations to induce GPT to process the task without running into any of the obstacles outlined above. And second of all, in all this discussion, it’s easy to lose sight of GPT’s primary advantage – that it operates in a small fraction of the time and the cost required to operate a human analyst. If the AI optimists are right and the future is really in automation, it does us well to first verify that GPT is up to par with a newbie analyst taking on a simple task. Once we have that foothold, we may be in a position to go farther than that.
在到达那里的漫长道路之外，这些似乎几乎是微不足道的，做了很多工作来模仿一个有 3 天经验解决简单任务的分析师。但是，首先，事实仍然是，需要所有这些谨慎的缓解措施来诱导 GPT 在不遇到上述任何障碍的情况下处理任务。其次，在所有这些讨论中，人们很容易忽视 GPT 的主要优势——它的运行时间和成本仅为操作人类分析师所需的一小部分。如果 AI 乐观主义者是对的，并且未来真的在自动化领域，那么首先验证 GPT 是否能与承担简单任务的新手分析师相提并论，这对我们很有帮助。一旦我们站稳了脚跟，我们可能就能走得更远。

For your convenience, we reproduce the initial prompt used in the above-linked triage task dialogue as addendum 1 below. Note that this initial prompt does not include the way GPT is coerced to interact with the problem and process new input – to see that, you would need to click through to the link.
为方便起见，我们将上述链接的会审任务对话框中使用的初始提示转载为下面的附录 1。请注意，此初始提示不包括 GPT 被胁迫与问题交互和处理新输入的方式——要看到这一点，您需要单击链接。

Addendum 1: Fully Engineered Prompt Used in Triage Task
附录 1：在会审任务中使用的完全工程化的提示

previous state of working memory:
以前的工作记忆状态：

BEGIN WORKING MEMORY 开始工作记忆

Directives: 指令：

These directives are not recommendations! Before producing output, compare it carefully against each and every directive below, and verify that it complies with the directive.
这些指令不是建议！在生成输出之前，请仔细将其与下面的每个指令进行比较，并验证它是否符合该指令。

Focus on your ‘killer instinct’ and intellectual autonomy. Call out absurdities! Pay attention to incongruities and opportunities!
专注于你的“杀手本能”和智力自主。大声疾呼荒谬！注意不协调和机会！

The buck stops here. Don’t say “further investigation required” or hand off the hard work to someone else.
雄鹿到此为止。不要说“需要进一步调查”或将艰苦的工作交给其他人。

Remember the principles of specificity and sensitivity. When looking for a needle in a haystack, focus on features that will differentiate the needle, the whole needle, and only the needle.
记住特异性和敏感性的原则。大海捞针时，请关注能够区分针头、整根针和仅区分针的特征。

Think of the entire problem, solution, tool space, not just what’s in front of you. Use the right tool for the job (e.g. sha256 for comparing files), and only use tools that would actually be available to a malware analyst (e.g. definitely not private resources of third parties).
想想整个问题、解决方案、工具空间，而不仅仅是你面前的东西。为工作使用正确的工具（例如，用于比较文件的 sha256），并且仅使用恶意软件分析师实际可用的工具（例如，绝对不是第三方的私有资源）。

Provide exactly one (1) ‘next analyst action’ as a specific and detailed instruction that must be immediately actionable. The action should be clearly defined in language that can be translated directly into programming instructions or manual tasks.
只提供一（1）个“下一步分析师行动”作为必须立即可操作的具体而详细的指示。动作应该用可以直接翻译成编程指令或手动任务的语言明确定义。

Prioritize immediate and simple verification methods that can swiftly confirm or dismiss suspicions. Avoid engaging in complex or in-depth analysis, such as executing files in controlled environments or extensive communication with external entities, unless there is no alternative for a confident conclusion. Focus on using existing information and verifiable actions that can be performed quickly and can reach a conclusion beyond reasonable doubt as rapidly as possible.
优先考虑可以迅速确认或消除怀疑的即时和简单的验证方法。避免进行复杂或深入的分析，例如在受控环境中执行文件或与外部实体进行广泛通信，除非没有其他选择可以得出可靠的结论。专注于使用可以快速执行的现有信息和可验证的行动，并可以尽快得出排除合理怀疑的结论。

You are computing a function f on an input X. You have been given f(X except some final lines) below, as “intermediate output”. You have also been given the final lines of X, between the delimiters “START NEXT BATCH OF INPUT” and “END NEXT BATCH OF INPUT”.
您正在计算输入 X 上的函数 f。下面给你的 f（X 除了最后几行）作为“中间输出”。您还获得了 X 的最后一行，位于分隔符“START NEXT BATCH OF INPUT”和“END NEXT BATCH OF INPUT”之间。

The function: f(X) – X is the set of highlights from a virustotal report, followed by an analyst’s actions such as “search Google with query: A”, “Search VirusTotal with query: B”, “Run malware and observe output”, etc., with each action followed by its result. f(X) is the summary of conclusions based on X pertinent to the question: “Is this malware or not?”, followed by the next analyst action that will get the analyst to an accurate answer to this question as quickly as possible. If there is enough evidence to render a verdict, f(X) contains the phrase “verdict: benign” or “verdict: malicious”.
函数：f（X） – X 是 virustotal 报告中的一组突出显示，然后是分析师的操作，例如“使用查询搜索 Google：A”、“使用查询搜索 VirusTotal：B”、“运行恶意软件并观察输出”等，每个操作后跟其结果。f（X）是基于 X 的结论摘要，与以下问题相关：“这是不是恶意软件？”，然后是下一个分析师行动，这将使分析师尽快准确地回答这个问题。如果有足够的证据做出判决，则 f（X）包含短语“判决：良性”或“判决：恶意”。

Intermediate output: None so far
中间输出：到目前为止没有

END WORKING MEMORY 结束工作记忆

START NEXT BATCH OF INPUT
开始下一批输入

File name: APSDaemon.exe Vendor analysis: 2 malicious verdicts/71 vendors
文件名： APSDaemon.exe 供应商分析：2 恶意判定/71 供应商

Creation Time: 2017-11-07 05:03:34 UTC Signature Date: 2017-11-07 05:05:00 UTC First Seen In The Wild: 2016-12-07 14:03:45 UTC First Submission: 2017-12-06 23:01:28 UTC Last Submission: 2023-07-10 22:02:11 UTC Last Analysis: 2023-07-28 06:43:18 UTC
创建日期： 2017-11-07 05：03：34 UTC 签名日期： 2017-11-07 05：05：00 UTC 首次出现在野外： 2016-12-07 14：03：45 UTC 首次提交： 2017-12-06 23：01：28 UTC 最后提交日期： 2023-07-10 22：02：11 UTC 最后分析： 2023-07-28 06：43：18 UTC

Capabilities and indicators: Affect system registries The file has content beyond the declared end of file. The file has authenticode/codesign signature information. Signed file, valid signature
功能和指示器：影响系统注册表文件的内容超出了声明的文件末尾。该文件具有验证码/协同设计签名信息。签名文件，有效签名

File version information Copyright © 2017 Apple Inc. All rights reserved. Product: Apple Push Description: Apple Push Original Name: APSDaemon.exe File Version 2.7.22.21 Date signed: 2017-11-07 03:05:00 UTC
文件版本信息版权所有 © 2017 Apple Inc.保留所有权利。产品：苹果推送描述：苹果推送原始名称： APSDaemon.exe 文件版本 2.7.22.21 签名日期： 2017-11-07 03：05：00 UTC

Top level signature: Name: Apple Inc. Status: This certificate or one of the certificates in the certificate chain is not time valid. Issuer: Symantec Class 3 SHA256 Code Signing CA Valid From: 12:00 AM 02/25/2016 Valid To: 11:59 PM 02/24/2018 Valid Usage: Code Signing Algorithm: sha256RSA Thumbprint: EF74C7E726EE9BE45BD2B23544F9CFDE61000C8A Serial Number: 0E BC 19 35 D5 29 4A 59 4B 4F 32 70 7B 0A 0A B9
顶级签名：名称： Apple Inc. 状态：此证书或证书链中的某个证书没有时间有效。颁发者：Symantec Class 3 SHA256 代码签名 CA 有效时间： 12：00 AM 02/25/2016 有效期至： 11：59 PM 02/24/2018 有效用法：代码签名算法：sha256RSA 指纹：EF74C7E726EE9BE45BD2B23544F9CFDE61000C8A 序列号： 0E BC 19 35 D5 29 4A 59 4B 4F 32 70 7B 0A 0A B9

DNS queries: 154.21.82.20.in-addr.arpa 82.250.63.168.in-addr.arpa crl.thawte.com fp2e7a.wpc.2be4.phicdn.net ocsp.thawte.com prda.aadg.msidentity.com
DNS查询：154.21.82.20.in-addr.arpa 82.250.63.168.in-addr.arpa crl.thawte.com fp2e7a.wpc.2be4.phicdn.net ocsp.thawte.com prda.aadg.msidentity.com

Files written: C:984.tmp C:984.tmp.csv C:9C.tmp C:9C.tmp.txt C:_v4.0_32.exe.log C:_v4.0_32.log C:.log C:.log C:.out C:
写入文件： C：984.tmp C：984.tmp.csv C：9C.tmp C：9C.tmp.txt C：_v4.0_32.exe.log C：_v4.0_32.log C：.log C：.log C：.out C：

Files deleted: %USERPROFILE%1I0ZU[1].xml C:1056.tmp.WERInternalMetadata.xml C:1068.tmp.csv C:1079.tmp.txt C:10B4.tmp.WERInternalMetadata.xml C:1160.tmp.csv C:119F.tmp.txt C:122B.tmp.WERInternalMetadata.xml C:12E7.tmp.csv C:1316.tmp.txt
已删除的文件：%USERPROFILE%1I0ZU[1].xml C：1056.tmp。WERInternalMetadata.xml C：1068.tmp.csv C：1079.tmp.txt C：10B4.tmp.WERInternalMetadata.xml C：1160.tmp.csv C：119F.tmp.txt C：122B.tmp.WERInternalMetadata.xml C：12E7.tmp.csv C：1316.tmp.txt

Registry keys set: HKEY_LOCAL_MACHINESettings*-1-5-21-1015118539-3749460369-599379286-1001{A79EEDB6-96F6-4E65-BDCB-3A66617000FA} HKEY_LOCAL_MACHINESettings*-1-5-21-1015118539-3749460369-599379286-1001{A79EEDB6-96F6-4E65-BDCB-3A66617000FA} HKEY_LOCAL_MACHINESettings*-1-5-21-1015118539-3749460369-599379286-1001{A79EEDB6-96F6-4E65-BDCB-3A66617000FA} HKEY_LOCAL_MACHINESettings*-1-5-21-1015118539-3749460369-599379286-1001{A79EEDB6-96F6-4E65-BDCB-3A66617000FA} HKEY_LOCAL_MACHINE6432NodeInc.Application Support HKEY_LOCAL_MACHINE3a19d3d1 HKEY_LOCAL_MACHINE3a19d3d1 HKEY_LOCAL_MACHINE3a19d3d1 HKEY_LOCAL_MACHINE3a19d3d1 HKEY_LOCAL_MACHINE79b42b
注册表项集：HKEY_LOCAL_MACHINESettings*-1-5-21-1015118539-3749460369-599379286-1001{A79EEDB6-96F6-4E65-BDCB-3A66617000FA} HKEY_LOCAL_MACHINESettings*-1-5-21-1015118539-3749460369-599379286-1001{A79EEDB6-96F6-4E655 -BDCB-3A66617000FA} HKEY_LOCAL_MACHINESettings*-1-5-21-1015118539-3749460369-599379286-1001{A79EEDB6-96F6-4E65-BDCB-3A66617000FA} HKEY_LOCAL_MACHINESettings*-1-5-21-1015118539-3749460369-599379286-1001{A79EEDB6-96F6-4E65-BDCB-3A66617000FA} HKEY_LOCAL_MACHINE6432NodeInc.应用支持HKEY_LOCAL_MACHINE3a19d3d1 HKEY_LOCAL_MACHINE3a19d3d1 HKEY_LOCAL_MACHINE3a19d3d1 HKEY_LOCAL_MACHINE3a19d3d1 HKEY_LOCAL_MACHINE79b42b

Processes created: %SAMPLEPATH%.exe C:.exe C:45eb5c9d3f89cb059212e00512ec0e6c47c1bdf12842256ceda5d4f1371bd5.exe
创建的进程： %SAMPLEPATH%.exe C：.exe C：45eb5c9d3f89cb059212e00512ec0e6c47c1bdf12842256ceda5d4f1371bd5.exe

Shell commands: %SAMPLEPATH%.exe C:.exe C:45eb5c9d3f89cb059212e00512ec0e6c47c1bdf12842256ceda5d4f1371bd5.exe
Shell 命令： %SAMPLEPATH%.exe C：.exe C：45eb5c9d3f89cb059212e00512ec0e6c47c1bdf12842256ceda5d4f1371bd5.exe

Mutexes Created: :1728:120:WilError_01 :1728:304:WilStaging_02 :2444:120:WilError_01 :2444:304:WilStaging_02 :4312:120:WilError_01 :4312:304:WilStaging_02 :4580:304:WilStaging_02 :5168:120:WilError_01 :5168:304:WilStaging_02
互斥锁创建：：1728：120：WilError_01 ：1728：304：WilStaging_02 ：2444：120：WilError_01 ：2444：304：WilStaging_02 ：4312：120：WilError_01 ：4312：304：WilStaging_02 ：4580：304：WilStaging_02 ：5168：120：WilError_01 ：5168：304：WilStaging_02

Strings: YSLoader ignoring invalid key/value pair %S YSLoader ignoring unknown/unsupported log flag: %S YSLoader ignoring unknown/unsupported Announce action: %S YSLoader ignoring unknown/unsupported key/value pair %S Win32 error %u attempting to count UTF-16 characters based on UTF-8 Win32 error %u attempting to convert UTF-8 string to UTF-16 Win32 error %u attempting to count UTF-8 characters based on UTF-16 Win32 error %u attempting to convert UTF-16 string to UTF-8 YSLoader checking for parameters in environment variable “%S” Win32 error %u attempting to find parent process catastrophic error in YSLoader WinMain: GetCommandLineW failed EXCEPTION_FLT_DIVIDE_BY_ZERO: The thread tried to divide a floating-point value by a floating-point divisor of zero. EXCEPTION_FLT_INEXACT_RESULT: The result of a floating-point operation cannot be represented exactly as a decimal fraction. EXCEPTION_FLT_INVALID_OPERATION: This exception represents any floating-point exception not included in this list. EXCEPTION_FLT_OVERFLOW: The exponent of a floating-point operation is greater than the magnitude allowed by the corresponding type. EXCEPTION_FLT_STACK_CHECK: The stack overflowed or underflowed as the result of a floating-point operation. EXCEPTION_FLT_UNDERFLOW: The exponent of a floating-point operation is less than the magnitude allowed by the corresponding type. EXCEPTION_ILLEGAL_INSTRUCTION: The thread tried to execute an invalid instruction. EXCEPTION_IN_PAGE_ERROR: The thread tried to access a page that was not present, and the system was unable to load the page. For example, this exception might occur if a network connection is lost while running a program over the network. EXCEPTION_INT_DIVIDE_BY_ZERO: The thread tried to divide an integer value by an integer divisor of zero. EXCEPTION_INT_OVERFLOW: The result of an integer operation caused a carry out of the most significant bit of the result. EXCEPTION_INVALID_DISPOSITION: An exception handler returned an invalid disposition to the exception dispatcher. Programmers using a high-level language such as C should never encounter this exception. EXCEPTION_NONCONTINUABLE_EXCEPTION: The thread tried to continue execution after a noncontinuable exception occurred. EXCEPTION_PRIV_INSTRUCTION: The thread tried to execute an instruction whose operation is not allowed in the current machine mode. EXCEPTION_SINGLE_STEP: A trace trap or other single-instruction mechanism signaled that one instruction has been executed. unknown structured exception 0x%08lu ADVAPI32.DLL @YSCrashDump VS_VERSION_INFO StringFileInfo 00000000 CompanyName Apple Inc. FileDescription Apple Push FileVersion 2.7.22.21 LegalCopyright 2017 Apple Inc. All rights reserved. OriginalFilename APSDaemon.exe ProductName VarFileInfo Translation
字符串： YSLoader 忽略无效的键/值对 %S YSLoader 忽略未知/不支持的日志标志： %S YSLoader 忽略未知/不支持宣布操作： %S YSLoader 忽略未知/不支持的键/值对 %S Win32 错误 %u 尝试根据 UTF-8 对 UTF-16 字符进行计数 Win32 错误 %u 尝试将 UTF-8 字符串转换为 UTF-16 Win32 错误 %u 尝试根据 UTF-16 对 UTF-8 字符进行计数 Win32 错误 %u 尝试将 UTF-16 字符串转换为 UTF-8 YSLoader 检查环境变量“%S”中的参数 Win32 错误 %u 尝试在 YSLoader WinMain 中查找父进程灾难性错误： GetCommandLineW 失败 EXCEPTION_FLT_DIVIDE_BY_ZERO：线程尝试将浮点值除以零的浮点除数。EXCEPTION_FLT_INEXACT_RESULT：浮点运算的结果不能完全表示为小数。EXCEPTION_FLT_INVALID_OPERATION：此异常表示此列表中未包含的任何浮点异常。EXCEPTION_FLT_OVERFLOW：浮点运算的指数大于相应类型允许的幅度。EXCEPTION_FLT_STACK_CHECK：由于浮点运算，堆栈溢出或下溢。EXCEPTION_FLT_UNDERFLOW：浮点运算的指数小于相应类型允许的幅度。EXCEPTION_ILLEGAL_INSTRUCTION：线程尝试执行无效指令。EXCEPTION_IN_PAGE_ERROR：线程尝试访问不存在的页面，但系统无法加载该页面。例如，如果在通过网络运行程序时网络连接丢失，则可能会发生此异常。EXCEPTION_INT_DIVIDE_BY_ZERO：线程尝试将整数值除以零的整数除数。 EXCEPTION_INT_OVERFLOW：整数运算的结果导致了结果中最高有效位的执行。EXCEPTION_INVALID_DISPOSITION：异常处理程序向异常调度程序返回了无效的处置。使用高级语言（如 C）的程序员不应遇到此异常。EXCEPTION_NONCONTINUABLE_EXCEPTION：线程在发生不可连续的异常后尝试继续执行。EXCEPTION_PRIV_INSTRUCTION：线程尝试执行在当前计算机模式下不允许操作的指令。EXCEPTION_SINGLE_STEP：跟踪陷阱或其他单指令机制发出信号，表明一条指令已执行。未知结构化异常 0x%08lu ADVAPI32.DLL @YSCrashDump VS_VERSION_INFO StringFileInfo 000000000 公司名称 Apple Inc. 文件说明 Apple Push FileVersion 2.7.22.21 法律Copyright 2017 Apple Inc.保留所有权利。OriginalFilename APSDaemon.exe ProductName VarFileInfo 翻译

END NEXT BATCH OF INPUT
结束下一批输入

Please output, with no introductions, addendums or ceremony, the next state of the working memory after processing the above lines of input.
请在处理上述输入行后输出工作记忆的下一个状态，不带介绍、附录或仪式。

原文始发于cp<r>：GPT VS MALWARE ANALYSIS: CHALLENGES AND MITIGATIONS

版权声明：admin 发表于 2023年11月17日上午8:58。
转载请注明：GPT VS MALWARE ANALYSIS: CHALLENGES AND MITIGATIONS | CTF导航

ida脚本开发环境配置idapython&idacpp三端环境(win、mac、linux)

admin

另一种基于AVX2/SSE2的高效模式匹配算法在内存搜索中的应用

admin

【逆向系列】7-某加速验证码识别-密码加密逆向

admin

418

恶意程序的快速分析

admin

736

Off-By-One Con2024 N-day fullchain 利用

admin

Known Indicators of Compromise Associated with Androxgh0st Malware

admin

暂无评论

您必须登录才能参与评论！

立即登录

暂无评论...

GPT VS MALWARE ANALYSIS: CHALLENGES AND MITIGATIONS

Key Takeaways 关键要点

Introduction 介绍

The Triage Task 会审任务

GPT’s Natural Strengths GPT 的天然优势

Sentence Completion 句子完成

Big Picture Summaries 大图摘要

Logical Reasoning 逻辑推理

Challenges in Practical GPT Application
GPT实际应用的挑战

Memory Window Drift 内存窗口漂移

Gap Between Knowledge and Action
知识与行动之间的差距

Mitigation 缓解

Logical Reasoning Ceiling
逻辑推理上限

Mitigation 缓解

Detachment from Expertise
脱离专业知识

Mitigation 缓解

Goal Orientation Gap 目标取向差距

Goal Orientation Gap 1: Resource Management
目标导向差距 1：资源管理

Goal Orientation Gap 2: Agency
目标取向差距2：代理

Goal Orientation Gap 3: Politicking
目标导向差距3：政治

Goal Orientation: Conclusion & Mitigation
目标导向：结论与缓解

Spatial Blindness 空间盲症

Mitigation 缓解

Full Proof of Concept & Conclusion
完整的概念验证和结论

Addendum 1: Fully Engineered Prompt Used in Triage Task
附录 1：在会审任务中使用的完全工程化的提示

Don’t throw a hissy fit; defend against Medusa

【恶意文件】钓鱼邮件通过脚本传播远控木马

相关文章

暂无评论

相关文章

GPT VS MALWARE ANALYSIS: CHALLENGES AND MITIGATIONS

Key Takeaways 关键要点

Introduction 介绍

The Triage Task 会审任务

GPT’s Natural Strengths GPT 的天然优势

Sentence Completion 句子完成

Big Picture Summaries 大图摘要

Logical Reasoning 逻辑推理

Challenges in Practical GPT Application GPT实际应用的挑战

Memory Window Drift 内存窗口漂移

Gap Between Knowledge and Action 知识与行动之间的差距

Mitigation 缓解

Logical Reasoning Ceiling 逻辑推理上限

Mitigation 缓解

Detachment from Expertise 脱离专业知识

Mitigation 缓解

Goal Orientation Gap 目标取向差距

Goal Orientation Gap 1: Resource Management 目标导向差距 1：资源管理

Goal Orientation Gap 2: Agency 目标取向差距2：代理

Goal Orientation Gap 3: Politicking 目标导向差距3：政治

Goal Orientation: Conclusion & Mitigation 目标导向：结论与缓解

Spatial Blindness 空间盲症

Mitigation 缓解

Full Proof of Concept & Conclusion 完整的概念验证和结论

Addendum 1: Fully Engineered Prompt Used in Triage Task 附录 1：在会审任务中使用的完全工程化的提示

Don’t throw a hissy fit; defend against Medusa

【恶意文件】钓鱼邮件通过脚本传播远控木马

相关文章

暂无评论

广告位

相关文章

Challenges in Practical GPT Application
GPT实际应用的挑战

Gap Between Knowledge and Action
知识与行动之间的差距

Logical Reasoning Ceiling
逻辑推理上限

Detachment from Expertise
脱离专业知识

Goal Orientation Gap 1: Resource Management
目标导向差距 1：资源管理

Goal Orientation Gap 2: Agency
目标取向差距2：代理

Goal Orientation Gap 3: Politicking
目标导向差距3：政治

Goal Orientation: Conclusion & Mitigation
目标导向：结论与缓解

Full Proof of Concept & Conclusion
完整的概念验证和结论

Addendum 1: Fully Engineered Prompt Used in Triage Task
附录 1：在会审任务中使用的完全工程化的提示