How LLMs Work, Explained Without Math

AI 3个月前 admin

36 0 0

I’m sure you agree that it has become impossible to ignore Generative AI (GenAI), as we are constantly bombarded with mainstream news about Large Language Models (LLMs). Very likely you have tried ChatGPT, maybe even keep it open all the time as an assistant.
我相信你同意，已经不可能忽视生成式人工智能（GenAI），因为我们不断受到关于大型语言模型（LLMs）的主流新闻的轰炸。很可能你已经尝试过 ChatGPT，甚至可能作为助手一直保持打开状态。

A basic question I think a lot of people have about the GenAI revolution is where does the apparent intelligence these models have come from. In this article, I’m going to attempt to explain in simple terms and without using advanced math how generative text models work, to help you think about them as computer algorithms and not as magic.
我认为很多人对GenAI革命的一个基本问题是，这些模型的明显智能来自哪里。在这篇文章中，我将尝试用简单的术语来解释生成文本模型是如何工作的，而不是使用高级数学，以帮助你将它们视为计算机算法，而不是魔术。

What Does An LLM Do?
安LLM是做什么的？

I’ll begin by clearing a big misunderstanding people have regarding how Large Language Models work. The assumption that most people make is that these models can answer questions or chat with you, but in reality all they can do is take some text you provide as input and guess what the next word (or more accurately, the next token) is going to be.
首先，我将澄清人们对大型语言模型如何工作的一个重大误解。大多数人的假设是，这些模型可以回答问题或与你聊天，但实际上他们所能做的就是将你提供的一些文本作为输入，并猜测下一个单词（或者更准确地说，下一个标记）会是什么。

Let’s start to unravel the mystery of LLMs from the tokens.
让我们开始LLMs揭开令牌的神秘面纱。

Tokens 令牌

A token is the basic unit of text understood by the LLM. It is convenient to think of tokens as words, but for the LLM the goal is to encode text as efficiently as possible, so in many cases tokens represent sequences of characters that are shorter or longer than whole words. Punctuation symbols and spaces are also represented as tokens, either individually or grouped with other characters.
标记是 LLM.将标记视为单词很方便，但目标是LLM尽可能有效地对文本进行编码，因此在许多情况下，标记表示比整个单词更短或更长的字符序列。标点符号和空格也表示为标记，可以单独表示，也可以与其他字符组合。

The complete list of tokens used by an LLM are said to be the LLM’s vocabulary, since it can be used to express any possible text. The byte pair encoding (BPE) algorithm is commonly used by LLMs to generate a token vocabulary given an input dataset. Just so that you have some rough idea of scale, the GPT-2 language model, which is open source and can be studied in detail, uses a vocabulary of 50,257 tokens.
an LLM 使用的完整标记列表被称为 LLM的词汇表，因为它可以用来表达任何可能的文本。字节对编码（BPE）算法通常用于LLMs在给定输入数据集的情况下生成标记词汇表。为了让您对规模有一些粗略的了解，GPT-2 语言模型是开源的，可以详细研究，它使用了 50,257 个代币的词汇表。

Each token in an LLM’s vocabulary is given a unique identifier, usually a number. The LLM uses a tokenizer to convert between regular text given as a string and an equivalent sequence of tokens, given as a list of token numbers. If you are familiar with Python and want to play with tokens, you can install the tiktoken package from OpenAI:
LLM词汇表中的每个标记都有一个唯一的标识符，通常是一个数字。它LLM使用分词器在以字符串形式给出的常规文本和以令牌编号列表形式给出的等效标记序列之间进行转换。如果您熟悉 Python 并想玩代币，您可以从 OpenAI 安装该 tiktoken 包：

$ pip install tiktoken

Then try this in a Python prompt:
然后在 Python 提示符中尝试以下操作：

>>> import tiktoken
>>> encoding = tiktoken.encoding_for_model("gpt-2")

>>> encoding.encode("The quick brown fox jumps over the lazy dog.")
[464, 2068, 7586, 21831, 18045, 625, 262, 16931, 3290, 13]

>>> encoding.decode([464, 2068, 7586, 21831, 18045, 625, 262, 16931, 3290, 13])
'The quick brown fox jumps over the lazy dog.'

>>> encoding.decode([464])
'The'
>>> encoding.decode([2068])
' quick'
>>> encoding.decode([13])
'.'

You can see in this experiment that for the GPT-2 language model token 464 represents the word “The”, and token 2068 represents the word ” quick”, including a leading space. This model uses token 13 for the period.
在这个实验中可以看到，对于 GPT-2 语言模型，token 464 表示单词“The”，token 2068 表示单词“quick”，包括一个前导空格。此模型在期间使用令牌 13。

Because tokens are determined algorithmically, you may find strange things, such as these three variants of the word “the”, all encoded as different tokens by GPT-2:
因为代币是通过算法确定的，你可能会发现一些奇怪的东西，比如“the”这个词的这三个变体，都被 GPT-2 编码为不同的代币：

>>> encoding.encode('The')
[464]
>>> encoding.encode('the')
[1169]
>>> encoding.encode(' the')
[262]

The BPE algorithm doesn’t always map entire words to tokens. In fact, words that are less frequently used do not get to be their own token and have to be encoded with multiple tokens. Here is an example of a word that this model encodes with two tokens:
BPE 算法并不总是将整个单词映射到标记。事实上，不太常用的单词不能成为自己的标记，必须用多个标记进行编码。下面是此模型使用两个标记编码的单词示例：

>>> encoding.encode("Payment")
[19197, 434]

>>> encoding.decode([19197])
'Pay'
>>> encoding.decode([434])
'ment'

Next Token Predictions 下一个代币预测

As I stated above, given some text, a language model makes predictions about what token will follow right after. If it helps to see this with Python pseudo-code, here is how you could run one of these models to get predictions for the next token:
正如我上面所说，给定一些文本，语言模型会预测紧随其后的令牌。如果使用 Python 伪代码看到这一点会有所帮助，那么以下是运行这些模型之一以获取下一个令牌的预测的方法：

predictions = get_token_predictions(['The', ' quick', ' brown', ' fox'])

The function gets a list of input tokens, which are encoded from the prompt provided by the user. In this example I’m assuming words are all individual tokens. To keep things simple I’m using the textual representation of each token, but as you’ve seen before in reality each token will be passed to the model as a number.
该函数获取输入令牌的列表，这些令牌根据用户提供的提示进行编码。在此示例中，我假设单词都是单个标记。为了简单起见，我使用每个标记的文本表示，但正如您之前所看到的，实际上每个标记都将作为数字传递给模型。

The returned value of this function is a data structure that assigns each token in the vocabulary a probability to follow the input text. If this was based on GPT-2, the return value of the function would be a list of 50,257 floating point numbers, each predicting a probability that the corresponding token will come next.
此函数的返回值是一个数据结构，它为词汇表中的每个标记分配跟随输入文本的概率。如果这是基于 GPT-2，则该函数的返回值将是 50,257 个浮点数的列表，每个浮点数都预测下一个相应代币出现的概率。

In the example above you could imagine that a well trained language model will give the token “jumps” a high probability to follow the partial phrase “The quick brown fox” that I used as prompt. Once again assuming a model trained appropriately, you could also imagine that the probability of a random word such as “potato” continuing this phrase is going to be much lower and close to 0.
在上面的例子中，你可以想象一个训练有素的语言模型会给令牌“跳跃”一个很高的概率，以遵循我用作提示的部分短语“快速的棕色狐狸”。再一次，假设一个模型经过适当的训练，你也可以想象一个随机词（如“土豆”）继续这个短语的概率会低得多，接近于0。

To be able to produce reasonable predictions, the language model has to go through a training process. During training, it is presented with lots and lots of text to learn from. At the end of the training, the model is able to calculate next token probabilities for a given token sequence using data structures that it has built using all the text that it saw in training.
为了能够产生合理的预测，语言模型必须经过一个训练过程。在培训过程中，它会呈现出很多很多的文本供您学习。在训练结束时，模型能够使用它在训练中看到的所有文本构建的数据结构来计算给定令牌序列的下一个令牌概率。

Is this different from what you expected? I hope this is starting to look less magical now.
这与您的预期不同吗？我希望现在看起来不那么神奇了。

Generating Long Text Sequences
生成长文本序列

Since the model can only predict what the next token is going to be, the only way to make it generate complete sentences is to run the model multiple times in a loop. With each loop iteration a new token is generated, chosen from the returned probabilities. This token is then added to the input that is given to the model on the next iteration of the loop, and this continues until sufficient text has been generated.
由于模型只能预测下一个标记将是什么，因此使其生成完整句子的唯一方法是在循环中多次运行模型。每次循环迭代都会生成一个新的标记，从返回的概率中选择。然后，此标记将添加到循环的下一次迭代时提供给模型的输入中，并且一直持续到生成足够的文本为止。

Let’s look at a more complete Python pseudo-code showing how this would work:
让我们看一个更完整的 Python 伪代码，展示它是如何工作的：

def generate_text(prompt, num_tokens, hyperparameters):
    tokens = tokenize(prompt)
    for i in range(num_tokens):
        predictions = get_token_predictions(tokens)
        next_token = select_next_token(predictions, hyperparameters)
        tokens.append(next_token)
    return ''.join(tokens)

The generate_text() function takes a user prompt as an argument. This could be, for example, a question.
该 generate_text() 函数将用户提示作为参数。例如，这可能是一个问题。

The tokenize() helper function converts the prompt to an equivalent list of tokens, using tiktoken or a similar library. Inside the for-loop, the get_token_predictions() function is where the AI model is called to get the probabilitles for the next token, as in the previous example.
tokenize() 帮助程序函数使用 tiktoken 或类似的库将提示转换为等效的令牌列表。在 for 循环中，该 get_token_predictions() 函数是调用 AI 模型以获取下一个令牌的概率的地方，如上例所示。

The job of the select_next_token() function is to take the next token probabilities (or predictions) and pick the best token to continue the input sequence. The function could just pick the token with the highest probability, which in machine learning is called a greedy selection. Better yet, it can pick a token using a random number generator that honors the probabilities returned by the model, and in that way add some variety to the generated text. This will also make the model produce different responses if given the same prompt multiple times.
该 select_next_token() 函数的工作是获取下一个令牌概率（或预测）并选择最佳令牌以继续输入序列。该函数可以只选择概率最高的代币，这在机器学习中称为贪婪选择。更好的是，它可以使用随机数生成器选择一个标记，该生成器遵循模型返回的概率，并以这种方式为生成的文本添加一些多样性。如果多次给出相同的提示，这也将使模型产生不同的响应。

To make the token selection process even more flexible, the probabilities returned by the LLM can be modified using hyperparameters, which are passed to the text generation function as arguments. The hyperparameters allow you to control the “greediness” of the token selection process. If you have used LLMs, you are likely familiar with the temperature hyperparameter. With a higher temperature, the token probabilities are flattened out, and this augments the chances of less likely tokens to be selected, with the end result of making the generated text look more creative or unusual. You may have also used two other hyperparameters called top_p and top_k, which control how many of the highest probable tokens are considered for selection.
为了使标记选择过程更加灵活，LLM可以使用超参数修改返回的概率，这些超参数作为参数传递给文本生成函数。超参数允许您控制代币选择过程的“贪婪程度”。如果您使用LLMs过，您可能熟悉 temperature 超参数。随着温度的升高，标记概率被拉平，这增加了不太可能选择标记的机会，最终结果是使生成的文本看起来更有创意或不寻常。您可能还使用了另外两个名为 top_p 和 top_k 的超参数，它们控制考虑选择的最多可能标记的数量。

Once a token has been selected, the loop iterates and now the model is given an input that includes the new token at the end, and one more token is generated to follow it. The num_tokens argument controls how many iterations to run the loop for, or in other words, how much text to generate. The generated text can (and often does) end mid-sentence, because the LLM has no concept of sentences or paragraphs, since it just works on one token at a time. To prevent the generated text from ending in the middle of a sentence, we could consider the num_tokens argument as a maximum instead of an exact number of tokens to generate, and in that case we could stop the loop when a period token is generated.
选择令牌后，循环迭代，现在为模型提供一个输入，该输入在末尾包含新令牌，并生成另一个令牌以遵循它。该 num_tokens 参数控制要运行循环的迭代次数，或者换句话说，要生成多少文本。生成的文本可以（并且经常）在句子中间结束，因为没有LLM句子或段落的概念，因为它一次只处理一个标记。为了防止生成的文本在句子中间结束，我们可以将 num_tokens 参数视为最大值，而不是要生成的确切数量的标记，在这种情况下，我们可以在生成句点标记时停止循环。

If you’ve reached this point and understood everything then congratulations, you now know how LLMs work at a high level. Are you interested in more details? In the next section I’ll get a bit more technical, while still doing my best to avoid referencing the math that supports this technology, which is quite advanced.
如果你已经达到了这一点并理解了一切，那么恭喜你，你现在知道如何在高层次LLMs上工作。您对更多细节感兴趣吗？在下一节中，我将进行更多的技术性介绍，同时仍然尽最大努力避免引用支持这项技术的数学方法，这是非常先进的。

Model Training 模型训练

Unfortunately, discussing how a model is trained is actually difficult without using math. What I’m going to do is start by showing you a very simple training approach.
不幸的是，如果不使用数学，讨论如何训练模型实际上是很困难的。我首先要向你们展示一个非常简单的训练方法。

Given that the task is to predict tokens that follow other tokens, a simple way to train a model is to get all the pairs of consecutive tokens that appear in the training dataset and build a table of probabilities with them.
鉴于任务是预测跟随其他令牌的令牌，训练模型的一种简单方法是获取训练数据集中出现的所有连续令牌对，并使用它们构建概率表。

Let’s do this with a short vocabulary and dataset. Let’s say the model’s vocabulary has the following five tokens:
让我们用一个简短的词汇表和数据集来做到这一点。假设模型的词汇表具有以下五个标记：

['I', 'you', 'like', 'apples', 'bananas']

To keep this example short and simple, I’m not going to consider spaces or punctuation symbols as tokens.
为了保持这个例子的简短，我不打算将空格或标点符号视为标记。

Let’s use a training dataset that is composed of three sentences:
让我们使用一个由三个句子组成的训练数据集：

I like apples 我喜欢苹果
I like bananas 我喜欢香蕉
you like bananas 你喜欢香蕉

We can build a 5×5 table and in each cell write how many times the token representing the row of the cell is followed by the token representing the column. Here is the table built from the three sentences in the dataset:
我们可以构建一个 5×5 的表，并在每个单元格中写下表示单元格行的标记后跟表示列的标记的次数。下面是根据数据集中的三个句子构建的表：

–	like	apples	bananas
I	2
you	1
like		1	2
apples
bananas

Hopefully this is clear. The dataset has two instances of “I like”, one instance of “you like”, one instance of “like apples” and two of “like bananas”.
希望这是清楚的。数据集有两个“我喜欢”的实例，一个“你喜欢”的实例，一个“像苹果”的实例和两个“像香蕉”的实例。

Now that we know how many times each pair of tokens appeared in the training dataset, we can calculate the probabilities of each token following each other. To do this, we convert the numbers in each row to probabilities. For example, token “like” in the middle row of the table was followed once by “apples” and twice by “bananas”. That means that “apples” follows “like” 33.3% of the time, and “bananas” follows it the remaining 66.7%.
现在我们知道每对标记在训练数据集中出现的次数，我们可以计算每个标记相互跟随的概率。为此，我们将每行中的数字转换为概率。例如，表格中间行的标记“like”后面跟着“apples”，后面跟着“bananas”两次。这意味着“苹果”在33.3%的时间里跟着“喜欢”，而“香蕉”在剩下的66.7%的时间里跟着它。

Here is the complete table with all the probabilities calculated. Empty cells have a probability of 0%.
这是计算出所有概率的完整表格。空单元格的概率为 0%。

–	I	you	like	apples	bananas
I			100%
you			100%
like				33.3%	66.7%
apples	25%	25%	25%		25%
bananas	25%	25%	25%	25%

The rows for “I”, “you” and “like” are easy to calculate, but “apples” and “bananas” present a problem because they have no data at all, since the dataset does not have any examples with these tokens being followed by other tokens. Here we have a “hole” in our training, so to make sure that the model produces a prediction even when lacking training, I have decided to split the probabilities for a follow-up token for “apples” and “bananas” evenly across the other four possible tokens, which could obviously generate strange results, but at least the model will not get stuck when it reaches one of these two tokens.
“I”、“you”和“like”的行很容易计算，但“apples”和“bananas”存在问题，因为它们根本没有数据，因为数据集没有任何示例，这些标记后面跟着其他标记。在这里，我们的训练中有一个“漏洞”，因此为了确保模型即使在缺乏训练的情况下也能产生预测，我决定将“苹果”和“香蕉”的后续标记的概率平均分配给其他四个可能的标记，这显然会产生奇怪的结果，但至少模型在到达这两个标记之一时不会卡住。

The problem of holes in training data is actually important. In real LLMs the training datasets are very large, so you would not find training holes that are so obvious as in my tiny example above. But smaller, more difficult to detect holes due to low coverage in the training data do exist and are fairly common. The quality of the token predictions the LLM makes in these poorly trained areas can be bad, but often in ways that are difficult to perceive. This is one of the reasons LLMs can sometimes hallucinate, which happens when the generated text reads well, but contains factual errors or inconsistencies.
训练数据中的漏洞问题实际上很重要。实际上LLMs，训练数据集非常大，因此您不会发现像我上面的小示例中那样明显的训练漏洞。但是，由于训练数据覆盖率低，更小、更难检测的漏洞确实存在，并且相当普遍。在这些训练有素的领域LLM中，代币预测的质量可能很差，但通常难以察觉。这是有时会产生幻觉的原因之一LLMs，当生成的文本阅读良好，但包含事实错误或不一致时，就会发生这种情况。

Using the probabilities table above, you may now imagine how an implementation of the get_token_predictions() function would work. In Python pseudo-code it would be something like this:
使用上面的概率表，您现在可以想象该 get_token_predictions() 函数的实现将如何工作。在 Python 伪代码中，它将是这样的：

def get_token_predictions(input_tokens):
    last_token = input_tokens[-1]
    return probabilities_table[last_token]

Simpler than expected, right? The function accepts a sequence of tokens, which come from the user prompt. It takes the last token in the sequence, and returns the row in the probabilities table that corresponds to that token.
比预期的要简单，对吧？该函数接受一系列令牌，这些令牌来自用户提示符。它采用序列中的最后一个标记，并返回概率表中与该标记对应的行。

If you were to call this function with ['you', 'like'] as input tokens, for example, the function would return the row for “like”, which gives the token “apples” a 33.3% chance of continuing the sentence, and the token “bananas” the other 66.7%. With these probabilities, the select_next_token() function shown above should choose “apples” one out of three times.
例如，如果将此函数用作 ['you', 'like'] 输入标记，则该函数将返回“like”的行，这为标记“apples”提供了 33.3% 的继续句子的机会，而标记“bananas”则返回了另外 66.7%。有了这些概率，上面显示的 select_next_token() 函数应该选择“苹果”三分之一。

When the “apples” token is selected as a continuation of “you like”, the sentence “you like apples” will be formed. This is an original sentence that did not exist in the training dataset, yet it is perfectly reasonable. Hopefully you are starting to get an idea of how these models can come up with what appears to be original ideas or concepts, just by reusing patterns and stitching together different bits of what they learned in training.
当选择“苹果”标记作为“你喜欢”的延续时，将形成句子“你喜欢苹果”。这是一个原始句子，在训练数据集中不存在，但它是完全合理的。希望您开始了解这些模型如何提出看似原创的想法或概念，只需重用模式并将他们在训练中学到的不同部分拼接在一起即可。

The Context Window 上下文窗口

The approach I took in the previous section to train my mini-language model is called a Markov chain.
我在上一节中用于训练迷你语言模型的方法称为马尔可夫链。

An issue with this technique is that only one token (the last of the input) is used to make a prediction. Any text that appears before that last token doesn’t have any influence when choosing how to continue, so we can say that the context window of this solution is equal to one token, which is very small. With such a small context window the model constantly “forgets” its line of thought and jumps from one word to the next without much consistency.
这种技术的一个问题是，只有一个标记（输入的最后一个）用于进行预测。在选择如何继续时，出现在最后一个标记之前的任何文本都不会产生任何影响，因此我们可以说此解决方案的上下文窗口等于一个非常小的标记。在如此小的上下文窗口下，模型不断“忘记”其思路，从一个单词跳到下一个单词，没有太多一致性。

To improve the model’s predictions a larger probabilities table can be constructed. To use a context window of two tokens, additional table rows would have to be added with rows that represent all possible sequences of two tokens. With the five tokens I used in the example there would be 25 new rows in the probabilities table each for a pair of tokens, added to the 5 single-token rows that are already there. The model would have to be trained again, this time looking at groups of three tokens in addition to the pairs. Then in each loop iteration of the get_token_predictions() function the last two tokens from the input would be used when available, to find the corresponding row in the larger probabilities table.
为了改进模型的预测，可以构建一个更大的概率表。要使用包含两个标记的上下文窗口，必须添加其他表行，这些行表示两个标记的所有可能序列。对于我在示例中使用的五个标记，概率表中将有 25 个新行，每个行代表一对标记，添加到已经存在的 5 个单标记行中。该模型必须再次训练，这次除了对之外，还要查看三个令牌的组。然后，在 get_token_predictions() 函数的每次循环迭代中，输入中的最后两个标记将在可用时使用，以在更大的概率表中查找相应的行。

But a context window of 2 tokens is still insufficient. For the generated text to be consistent with itself and make at least some basic sense, a much larger context window is needed. Without a large enough context it is impossible for newly generated tokens to relate to concepts or ideas expressed in previous tokens. So what can we do? Increasing the context window to 3 tokens would add 125 additional rows to the probabilities table, and the quality would still be very poor. How large do we need to make the context window?
但是 2 个令牌的上下文窗口仍然不够。为了使生成的文本与自身保持一致并至少具有一些基本意义，需要一个更大的上下文窗口。如果没有足够大的上下文，新生成的令牌就不可能与先前令牌中表达的概念或想法相关联。那么我们能做些什么呢？将上下文窗口增加到 3 个标记会向概率表添加 125 行，并且质量仍然很差。我们需要将上下文窗口做多大？

The open source GPT-2 model from OpenAI uses a context window of 1024 tokens. To be able to implement a context window of this size using Markov chains, each row of the probabilities table would have to represent a sequence that is between 1 and 1024 tokens long. Using the above example vocabulary of 5 tokens, there are 5¹⁰²⁴ possible sequences that are 1024 tokens long. How many table rows are required to represent this? I did the calculation in a Python session (scroll to the right to see the complete number):
OpenAI 的开源 GPT-2 模型使用 1024 个代币的上下文窗口。为了能够使用马尔可夫链实现这种大小的上下文窗口，概率表的每一行都必须表示一个长度介于 1 到 1024 个标记之间的序列。使用上面的 5 个标记的示例词汇表，有 5 个 ¹⁰²⁴ 可能的序列长度为 1024 个标记。需要多少个表行来表示这一点？我在 Python 会话中进行了计算（向右滚动以查看完整数字）：

>>> pow(5, 1024)
55626846462680034577255817933310101605480399511558295763833185422180110870347954896357078975312775514101683493275895275128810854038836502721400309634442970528269449838300058261990253686064590901798039126173562593355209381270166265416453973718012279499214790991212515897719252957621869994522193843748736289511290126272884996414561770466127838448395124802899527144151299810833802858809753719892490239782222290074816037776586657834841586939662825734294051183140794537141608771803070715941051121170285190347786926570042246331102750604036185540464179153763503857127117918822547579033069472418242684328083352174724579376695971173152319349449321466491373527284227385153411689217559966957882267024615430273115634918212890625

That is a lot of rows! And this is only a portion of the table, since we would also need sequences that are 1023 tokens long, 1022, etc., all the way to 1, since we want to make sure shorter sequences can also be handled when not enough tokens are available in the input. Markov chains are fun to work with, but they do have a big scalability problem.
那是很多行！这只是表的一部分，因为我们还需要长度为 1023 个标记的序列，1022 个等，一直到 1，因为我们希望确保当输入中没有足够的标记可用时，也可以处理较短的序列。马尔可夫链使用起来很有趣，但它们确实存在很大的可扩展性问题。

And a context window of 1024 tokens isn’t even that great anymore. With GPT-3, the context window was increased to 2048 tokens, then increased to 4096 in GPT-3.5. GPT-4 started with 8192 tokens, later got increased to 32K, and then again to 128K (that’s right, 128,000 tokens!). Models with 1M or larger context windows are starting to appear now, allowing models to have much better consistency and recall when they make token predictions.
1024 个代币的上下文窗口甚至不再那么好了。在 GPT-3 中，上下文窗口增加到 2048 个令牌，然后在 GPT-3.5 中增加到 4096 个。GPT-4 从 8192 个代币开始，后来增加到 32K，然后再次增加到 128K（没错，128,000 个代币！具有 1M 或更大上下文窗口的模型现在开始出现，允许模型在进行标记预测时具有更好的一致性和召回率。

In conclusion, Markov chains allow us to think about the problem of text generation in the right way, but they have big issues that prevent us from considering them as a viable solution.
总之，马尔可夫链允许我们以正确的方式思考文本生成问题，但它们存在很大的问题，使我们无法将它们视为可行的解决方案。

From Markov Chains to Neural Networks
从马尔可夫链到神经网络

Obviously we have to forget the idea of having a table of probabilities, since a table for a reasonable context window would require an impossibly large amount of RAM. What we can do is replace the table with a function that returns an approximation of what the token probabilities would be, generated algorithmically instead of stored as a big table. This is actually something that neural networks can do well.
显然，我们必须忘记拥有概率表的想法，因为合理上下文窗口的表将需要大量 RAM。我们可以做的是用一个函数替换表，该函数返回令牌概率的近似值，以算法生成而不是存储为大表。这实际上是神经网络可以做得很好的事情。

A neural network is a special type of function that takes some inputs, performs some calculations on them, and returns an output. For a language model the inputs are the tokens that represent the prompt, and the output is the list of predicted probabilities for the next token.
神经网络是一种特殊类型的函数，它接受一些输入，对它们执行一些计算，然后返回输出。对于语言模型，输入是表示提示的标记，输出是下一个标记的预测概率列表。

I said neural networks are “special” functions. What makes them special is that in addition to the function logic, the calculations they perform on the inputs are controlled by a number of externally defined parameters. Initially, the parameters of the network are not known, and as a result, the function produces and output that is completely useless. The training process for the neural network consists in finding the parameters that make the function perform the best when evaluated on the data from the training dataset, with the assumption that if the function works well with the training data it will work comparably well with other data.
我说过神经网络是“特殊”功能。它们的特别之处在于，除了函数逻辑之外，它们对输入执行的计算还由许多外部定义的参数控制。最初，网络的参数是未知的，因此，该函数产生和输出是完全无用的。神经网络的训练过程包括找到使函数在对训练数据集中的数据进行评估时表现最佳的参数，并假设如果函数与训练数据配合良好，它将与其他数据配合得相当好。

During the training process, the parameters are iteratively adjusted in small increments using an algorithm called backpropagation which is heavy on math, so I won’t discuss in this article. With each adjustment, the predictions of the neural network are expected to become a tiny bit better. After an update to the parameters, the network is evaluated again against the training dataset, and the results inform the next round of adjustments. This process continues until the function performs good next token predictions on the training dataset.
在训练过程中，参数会使用一种称为反向传播的算法以小增量进行迭代调整，该算法需要大量数学知识，因此本文不会讨论。随着每一次调整，神经网络的预测有望变得更好一点。更新参数后，将根据训练数据集再次评估网络，结果为下一轮调整提供信息。此过程一直持续到函数对训练数据集执行良好的下一个令牌预测。

To help you have an idea of the scale at which neural networks work, consider that the GPT-2 model has about 1.5 billion parameters, and GPT-3 increased the parameter count to 175 billion. GPT-4 is said to have about 1.76 trillion parameters. Training neural networks at this scale with current generation hardware takes a very long time, usually weeks or months.
为了帮助您了解神经网络的工作规模，请考虑 GPT-2 模型大约有 15 亿个参数，而 GPT-3 将参数计数增加到 1750 亿个。据说 GPT-4 有大约 1.76 万亿个参数。使用最新一代硬件训练这种规模的神经网络需要很长时间，通常是数周或数月。

What is interesting is that because there are so many parameters, all calculated through a lengthy iterative process without human assistance, it is difficult to understand how a model works. A trained LLM is like a black box that is extremely difficult to debug, because most of the “thinking” of the model is hidden in the parameters. Even those who trained it have trouble explaining its inner workings.
有趣的是，由于有如此多的参数，都是在没有人工帮助的情况下通过漫长的迭代过程计算的，因此很难理解模型的工作原理。经过训练LLM的就像一个极难调试的黑匣子，因为模型的大部分“思维”都隐藏在参数中。即使是那些训练过它的人也很难解释它的内部运作。

Layers, Transformers and Attention
层、变压器和注意力

You may be curious to know what mysterious calculations happen inside the neural network function that can, with the help of well tuned parameters, take a list of input tokens and somehow output reasonable probabilities for the token that follows.
您可能很想知道神经网络函数内部发生了哪些神秘的计算，这些计算可以在经过良好调整的参数的帮助下获取输入令牌列表，并以某种方式为随后的令牌输出合理的概率。

A neural network is configured to perform a chain of operations, each called a layer. The first layer receives the inputs, and performs some type of transformation on them. The transformed inputs enter the next layer and are transformed once again. This continues until the data reaches the final layer and is transformed one last time, generating the output, or prediction.
神经网络配置为执行一连串操作，每个操作链称为一层。第一层接收输入，并对它们执行某种类型的转换。转换后的输入进入下一层并再次转换。这一直持续到数据到达最后一层并最后一次转换，从而生成输出或预测。

Machine learning experts come up with different types of layers that perform mathematical transformations on the input data, and they also figure out ways to organize and group layers so that they achieve a desired result. Some layers are of a general purpose, while others are designed to work on a specific type of input data, such as images or as in the case of LLMs, on tokenized text.
机器学习专家提出了不同类型的图层，这些图层对输入数据执行数学转换，他们还想出了组织和分组图层的方法，以便它们达到预期的结果。某些图层具有通用用途，而其他图层则设计用于处理特定类型的输入数据，例如图像或LLMs标记化文本。

The neural network architecture that is the most popular today for text generation in large language models is called the Transformer. LLMs that use this design are said to be GPTs, or Generative Pre-Trained Transformers.
当今在大型语言模型中用于文本生成最流行的神经网络架构称为 Transformer。LLMs使用这种设计被称为 GPT，即生成式预训练转换器。

The distinctive characteristic of transformer models is a layer calculation they perform called Attention, that allows them to derive relationships and patterns between tokens that are in the context window, which are then reflected in the resulting probabilities for the next token.
Transformer 模型的显着特征是它们执行的称为 Attention 的层计算，这允许它们在上下文窗口中的标记之间导出关系和模式，然后这些关系和模式反映在下一个标记的结果概率中。

The Attention mechanism was initially used in language translators, as a way to find which tokens in an input sequence are the most important to extract its meaning. This mechanism gives modern translators the ability to “understand” a sentence at a basic level, by focusing on (or driving “attention” to) the important words or tokens.
Attention 机制最初用于语言翻译器，用于查找输入序列中的哪些标记对于提取其含义最重要。这种机制使现代译者能够在基本层面上“理解”一个句子，通过关注（或推动“注意力”到）重要的单词或标记。

Do LLMs Have Intelligence?
有智慧吗LLMs？

By now you may be starting to form an opinion on wether LLMs show some form of intelligence in the way they generate text.
到现在为止，您可能已经开始对他们是否在生成文本的方式LLMs中表现出某种形式的智能形成意见。

I personally do not see LLMs as having an ability to reason or come up with original thoughts, but that does not mean to say they’re useless. Thanks to the clever calculations they perform on the tokens that are in the context window, LLMs are able to pick up on patterns that exist in the user prompt and match them to similar patterns learned during training. The text they generate is formed from bits and pieces of training data for the most part, but the way in which they stitch words (tokens, really) together is highly sophisticated, in many cases producing results that feel original and useful.
我个人认为LLMs他们没有推理或提出原创想法的能力，但这并不意味着它们毫无用处。由于他们对上下文窗口中的令牌执行了巧妙的计算，LLMs因此能够拾取用户提示中存在的模式，并将它们与训练期间学习的类似模式进行匹配。他们生成的文本大部分是由零碎的训练数据组成的，但他们将单词（实际上是标记）拼接在一起的方式非常复杂，在许多情况下会产生感觉原始和有用的结果。

On the other side, given the propensity of LLMs to hallucinate, I wouldn’t trust any workflow in which the LLM produces output that goes straight to end users without verification by a human.
另一方面，考虑到产生幻觉的LLMs倾向，我不会相信任何工作流程，在这种工作流程中，LLM生成的输出未经人工验证就直接发送给最终用户。

Will the larger LLMs that are going to appear in the following months or years achieve anything that resembles true intelligence? I feel this isn’t going to happen with the GPT architecture due to its many limitations, but who knows, maybe with some future innovations we’ll get there.
在接下来的几个月或几年里，将出现的更大的LLMs物体会实现类似于真正智能的东西吗？我觉得 GPT 架构不会发生这种情况，因为它有很多局限性，但谁知道呢，也许通过未来的一些创新，我们会到达那里。

The End 结束

Thank you for staying with me until the end! I hope I have picked your interested enough for you to decide to continue learning, and eventually facing all that scary math that you cannot avoid if you want to understand every detail. In that case, I can’t recommend Andrej Karpathy’s Neural Networks: Zero to Hero video series enough.
谢谢你陪我到最后！我希望我已经选择了你足够的兴趣，让你决定继续学习，并最终面对所有可怕的数学，如果你想了解每一个细节，你就无法避免。在这种情况下，我怎么推荐 Andrej Karpathy 的 Neural Networks： Zero to Hero 视频系列都不为过。