Attention Is All You Need 阅读详记-结合代码实现-个人在线分享-虚灵IT资料分享

Body

之前使用 Pytorch 从零实现一个简单的 Transformer，但是没有特别精细阅读过原论文。这次详细的阅读原文章，同时结合之前代码实现，希望能加深对模型结构的理解。这次做到双语对照，以便之后可以快速复习。文中有错误地方希望不吝指出。

Abstract

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism.

主流的序列转录模型转换模型基于复杂的循环神经网络或者卷积神经网络，这些模型通常包含一个编码器和解码器。表现最好的模型也是通过注意力机制将编码器和解码器连接起来。

We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.

Transformer 仅仅基于注意力机制，不需要循环神经网络或卷积神经网络。

Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train.

在两个机器翻译任务上的实验证明，它翻译质量更高，同时并行化程度高且训练时间少。

Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task,improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data

Transformer 在 8 张 GPU 上训练了 3.5 天就达到了单模型 sota。这些训练成本只占文献中最好模型的成本的一小部分。

1 Introduction

Recurrent neural networks, long short-term memory and gated recurrent neural networks in particular, have been firmly established as state of the art approaches in sequence modeling and transduction problems such as language modeling and machine translation [35, 2, 5]. Numerous efforts have since continued to push the boundaries of recurrent language models and encoder-decoder architectures.

循环神经网络，尤其是长短期记忆网络和门控循环神经网络，已经被确认为在诸如语言建模和机器翻译等序列建模和转录问题上的最好的方法。自此业界做了许多努力推动循环语言模型和编码-解码结构的工作边界。

Recurrent models typically factor computation along the symbol positions of the input and output sequences. Aligning the positions to steps in computation time, they generate a sequence of hidden states hth_tht, as a function of the previous hidden state ht−1h_{t−1}ht−1 and the input for position t. This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples. Recent work has achieved significant improvements in computational efficiency through factorization tricks and conditional computation , while also improving model performance in case of the latter. The fundamental constraint of sequential computation, however, remains.

循环模型将计算沿着输入和输出序列的符号位置进行分解。通过对齐位置和计算时间步，网络生成一连串隐藏状态。时间 t 的隐藏状态是由前一步隐藏状态和当前位置 t生成。这种固有的序列天性阻碍了训练的并行化，但是训练并行化对于长序列是至关重要的，因为内存限制批次中样本数量。

近期通过 factorization tricks 和 conditional computation 显著提高了计算效率，后者同时提高了模型表现，但是限制序列计算的关键弊端仍然存在。

Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences. In all but a few cases, however, such attention mechanisms are used in conjunction with a recurrent network

注意力机制已经成为各种出色的序列建模和转录模型中不可或缺的一部分。注意力机制能够建模依赖关系，而不需要考虑他们在输入输出序列中的距离。但是大多数情况下，注意力机制都与循环神经网络结合。

In this work we propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output. The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs.

Transformer 避免了循环，取而代之的是完全使用注意力机制来捕捉输入和输出之间的全局依赖关系。

2 Background

The goal of reducing sequential computation also forms the foundation of the Extended Neural GPU [16], ByteNet [18] and ConvS2S [9], all of which use convolutional neural networks as basic building block, computing hidden representations in parallel for all input and output positions. In these models, the number of operations required to relate signals from two arbitrary input or output positions grows in the distance between positions, linearly for ConvS2S and logarithmically for ByteNet. This makes it more difficult to learn dependencies between distant positions [12]. In the Transformer this is reduced to a constant number of operations, albeit at the cost of reduced effective resolution due to averaging attention-weighted positions, an effect we counteract with Multi-Head Attention as described in section 3.2.

为了降低序列计算催生了 Extended Neural GPU、ByteNet 和 ConvS2S。这些网络使用卷积神经网络来并行计算所有输入和输出的隐藏状态。这些网络中，关联输入输出中两个任意位置所需要的计算量随着位置距离增加而增加，对于 ConvS2S 是线性增加，对于 ByteNet 是对数增加，这使得学习远距离位置之间的依赖关系更加困难。Transformer 将计算量减少到常数级，尽管由于平均注意力权重位置降低了分辨率，但是多头注意力抵消了这个影响。

Self-attention, sometimes called intra-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence. Self-attention has been used successfully in a variety of tasks including reading comprehension, abstractive summarization, textual entailment and learning task-independent sentence representations

自注意力是一种注意力机制，它可以将单一序列不同位置关联起来，借此用于计算序列的表示。

End-to-end memory networks are based on a recurrent attention mechanism instead of sequence aligned recurrence and have been shown to perform well on simple-language question answering and language modeling tasks

End-to-end memory 网络基于循环注意力机制，而不是序列对齐循环，它在简单语言问答和语言建模任务上表现良好。

To the best of our knowledge, however, the Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence aligned RNNs or convolution. In the following sections, we will describe the Transformer, motivate self-attention and discuss its advantages over models such as [17, 18] and [9]

Transformer 是首个完全依靠自注意力机制而不使用序列对齐循环神经网络或者卷积神经网络来计算输出输出的表示的转录模型。

3 Model Architecture

Most competitive neural sequence transduction models have an encoder-decoder structure [5, 2, 35]. Here, the encoder maps an input sequence of symbol representations (x1, …, xn) to a sequence of continuous representations z = (z1, …, zn). Given z, the decoder then generates an output sequence (y1, …, ym) of symbols one element at a time. At each step the model is auto-regressive [10], consuming the previously generated symbols as additional input when generating the next.

大部分有竞争力的转录模型采用编码器-解码器结构。

编码器：把输入的符号表示序列映射成连续的表示序列。
解码器：根据连续的表示序列，生成输出符号序列，一次生成一个符号。

每一步模型都是自回归的，在生成下一个符号时，都会把上一个生成的符号作为额外输入来辅助生成。

The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1, respectively.

Transformer 也遵循这种结构，采用堆叠的自注意力和逐点运算、全连接层作为编码器和解码器。

3.1 Encoder and Decoder Stacks

Encoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position wise fully connected feed-forward network. We employ a residual connection [11] around each of the two sub-layers, followed by layer normalization [1]. That is, the output of each sub-layer is LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension dmodel = 512.

编码器：由 6 个独立的层堆叠而成。

每一个层有两个子层。第一个子层是多头注意力机制，第二个子层是简单的逐点运算的全连接的 feed-forward 网络。每个子层会进行残差连接，紧接着进行层归一化。

每个子层的计算公式如下：

SubOutput(x)=LayerNorm(x+Sublayer(x))SubOutput(x)=LayerNorm(x + Sublayer(x))SubOutput(x)=LayerNorm(x+Sublayer(x))

为了实现残差连接，模型中所有子层和嵌入层输出维度均为 512。

EncoderLayer 和 Encoder 的实现如下：

# EncoderLayer
class EncoderLayer(nn.Module):
    def __init__(
        self, d_model: int = 512, heads: int = 8, d_ff: int = 2048, dropout: float = 0.1
    ) -> None:
        super().__init__()

        self.attn = MultiHeadAttention(d_model, heads, dropout)
        self.dropout_1 = nn.Dropout(dropout)
        self.norm_1 = Norm(d_model)

        self.ffn = FeedForward(d_model, d_ff, dropout)
        self.dropout_2 = nn.Dropout(dropout)
        self.norm_2 = Norm(d_model)

    def forward(
        self, x: torch.Tensor, mask: Optional[torch.Tensor] = None
    ) -> torch.Tensor:
        x = x + self.dropout_1(self.attn(x, x, x, mask))
        x = self.norm_1(x)

        x = x + self.dropout_2(self.ffn(x))
        x = self.norm_2(x)
        return x

# Encoder
class Encoder(nn.Module):
    def __init__(
        self,
        N: int = 6,
        d_model: int = 512,
        heads: int = 8,
        d_ff: int = 2048,
        dropout: float = 0.1,
    ) -> None:
        super().__init__()

        self.N = N
        self.layers = nn.ModuleList(
            [EncoderLayer(d_model, heads, d_ff, dropout) for _ in range(N)]
        )

    def forward(self, x: torch.Tensor, mask: torch.Tensor) -> torch.Tensor:
        for layer in self.layers:
            x = layer(x, mask)
        return x

Decoder: The decoder is also composed of a stack of N = 6 identical layers. In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack. Similar to the encoder, we employ residual connections around each of the sub-layers, followed by layer normalization. We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position i can depend only on the known outputs at positions less than i.

解码器：由 6 个独立的层堆叠而成。

每层有三个子层，相较于编码器的两层，加入的子层用于处理编码器输出的多头注意力。每个子层依然采用残差连接，紧接着进行层归一化。这里修改解码器的注意力层，以防止某个位置注意到后续的位置，确保位置 i 的预测仅仅依赖于已知输出中位置小于 i 的元素。

DecoderLayer 和 Decoder 的实现如下：

# DecoderLayer
class DecoderLayer(nn.Module):
    def __init__(
        self, d_model: int = 512, heads: int = 8, d_ff: int = 2048, dropout: float = 0.1
    ) -> None:
        super().__init__()

        self.attn_1 = MultiHeadAttention(d_model, heads, dropout)
        self.dropout_1 = nn.Dropout(dropout)
        self.norm_1 = Norm(d_model)

        self.attn_2 = MultiHeadAttention(d_model, heads, dropout)
        self.dropout_2 = nn.Dropout(dropout)
        self.norm_2 = Norm(d_model)

        self.ffn = FeedForward(d_model, d_ff, dropout)
        self.dropout_3 = nn.Dropout(dropout)
        self.norm_3 = Norm(d_model)

    def forward(
        self,
        x: torch.Tensor,
        enc_output: torch.Tensor,
        src_mask: torch.Tensor,
        tgt_mask: torch.Tensor,
    ) -> torch.Tensor:
        x = x + self.dropout_1(self.attn_1(x, x, x, tgt_mask))
        x = self.norm_1(x)

        x = x + self.dropout_2(self.attn_2(x, enc_output, enc_output, src_mask))
        x = self.norm_2(x)

        x = x + self.dropout_3(self.ffn(x))
        x = self.norm_3(x)
        return x

# Decoder
class Decoder(nn.Module):
    def __init__(
        self,
        N: int = 6,
        d_model: int = 512,
        heads: int = 8,
        d_ff: int = 2048,
        dropout: float = 0.1,
    ) -> None:
        super().__init__()

        self.N = N
        self.layers = nn.ModuleList(
            [DecoderLayer(d_model, heads, d_ff, dropout) for _ in range(N)]
        )

    def forward(
        self,
        x: torch.Tensor,
        enc_output: torch.Tensor,
        src_mask: torch.Tensor,
        tgt_mask: torch.Tensor,
    ) -> torch.Tensor:
        for layer in self.layers:
            x = layer(x, enc_output, src_mask, tgt_mask)
        return x

3.2 Attention

An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.

注意力函数的作用是将 query 和一组 key、value 映射到 output。其中 query、keys、values、output 都是向量。output 是由 values 加权求和得到的，这些权重是由 query 和对应的 key 计算得到的。

3.2.1 Scaled Dot-Product Attention

We call our particular attention “Scaled Dot-Product Attention” (Figure 2). The input consists of queries and keys of dimension dkd_kdk, and values of dimension dvd_vdv. We compute the dot products of the query with all keys, divide each by dk\sqrt{d_k}dk, and apply a softmax function to obtain the weights on the values. In practice, we compute the attention function on a set of queries simultaneously, packed together into a matrix Q. The keys and values are also packed together into matrices K and V.

Attention(Q,K,V)=softmax(QKTdk)VAttention(Q,K,V)=softmax(\frac{QK^T}{\sqrt{d_k}})VAttention(Q,K,V)=softmax(dkQKT)V

query 和 key 的维度为 dkd_kdk，value 的维度为 dvd_vdv。

The two most commonly used attention functions are additive attention [2], and dot-product (multi-plicative) attention. Dot-product attention is identical to our algorithm, except for the scaling factor of 1dk\frac{1}{\sqrt{d_k}}dk1 . Additive attention computes the compatibility function using a feed-forward network with a single hidden layer. While the two are similar in theoretical complexity, dot-product attention is much faster and more space-efficient in practice, since it can be implemented using highly optimized matrix multiplication code

常用的注意力函数有加和注意力和点积注意力两种。两者计算复杂度接近，但是点积注意力可以通过高度优化的矩阵乘法实现，因此更快且更加“内存高效”。

While for small values of dkd_kdk the two mechanisms perform similarly, additive attention outperforms dot product attention without scaling for larger values of dkd_kdk [3]. We suspect that for large values of dkd_kdk, the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients. To counteract this effect, we scale the dot products by 1dk\frac{1}{\sqrt{d_k}}dk1.

对于较小的 dkd_kdk，两种注意力机制表现相似。但如果没有较大 dkd_kdk 对点积注意力进行缩放，加和注意力要优于点积注意力。猜测对于较大的 dkd_kdk，点积结果很大，导致 softmax 梯度很小。为了抵消这个影响，对点积结果进行缩放。

3.2.2 Multi-Head Attention

Instead of performing a single attention function with dmodeld_{model}dmodel-dimensional keys, values and queries, we found it beneficial to linearly project the queries, keys and values hhh times with different, learned linear projections to dkd_kdk, dkd_kdk and dvd_vdv dimensions, respectively. On each of these projected versions of queries, keys and values we then perform the attention function in parallel, yielding dvd_vdv-dimensional output values. These are concatenated and once again projected, resulting in the final values, as depicted in Figure 2.

相较于对 dmodeld_{model}dmodel 维度的 keys values queries 进行单次注意力计算，更加有效的方法是对 queries keys values 分别使用不同且可学习的线性投影 hhh 次，将它们分别转换成 dkd_kdk、dKd_KdK 和 dvd_vdv 维度。对每个投影结果，可以并行执行注意力计算，产生一个 dvd_vdv 维度的结果，之后拼接起来再做一次投影得到最终结果。

MultiHead(Q,K,V)=Concat(head1,…,headh)WOMultiHead(Q,K,V)=Concat(head_1,…,head_h)W^OMultiHead(Q,K,V)=Concat(head1,…,headh)WO

headi=Attention(QWiQ,KWiK,VWiV)head_i=Attention(QW_i^Q,KW_iK,VW_i^V)headi=Attention(QWiQ,KWiK,VWiV)

参数矩阵如下：

WiQ=Rdmodel×dk,WiK=Rdmodel×dk,WiV=Rdmodel×dv,WiO=Rhdv×dmodelW_i^Q=R{d_{model} imes d_k},W_i^K=R{d_{model} imes d_k},W_i^V=R{d_{model} imes d_v},W_i^O=R{hd_v imes d_model}WiQ=Rdmodel×dk,WiK=Rdmodel×dk,WiV=Rdmodel×dv,WiO=Rhdv×dmodel

In this work we employ h=8h = 8h=8 parallel attention layers, or heads. For each of these we use dk=dv=dmodel/h=64d_k = d_v = d_{model}/h = 64dk=dv=dmodel/h=64. Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention with full dimensionality.

论文中选择 8 个头，前文中提到 dmodel=512d_{model}=512dmodel=512，因此 dk=dv=512/8=64d_k=d_v=512/8=64dk=dv=512/8=64。由于每个头的降维操作，多头注意力完整计算开销与单头注意力在完整维度上计算开销相似。

下面详细描述计算开销：

假设 seq_len=n,dmodel=dseq\_len=n,d_{model}=dseq_len=n,dmodel=d。单注意力在完整维度上计算开销由以下部分组成：

QKT→(n,d)×(d,n)QK^T o (n,d) imes (d,n)QKT→(n,d)×(d,n) 矩阵相乘计算量为 O(n2d)O(n^2d)O(n2d)
softmaxsoftmaxsoftmax 每行复杂度为 O(n)O(n)O(n) 总计算量为 O(n2)O(n^2)O(n2)
VVV 加权和 (n,n)×(n,d)(n,n) imes (n,d)(n,n)×(n,d) 矩阵相乘计算量为 O(n2d)O(n^2d)O(n2d)

总体计算量为 O(n2d)O(n^2d)O(n2d)

多头注意力将 ddd 分为 hhh 组，每组的维度为 d/hd/hd/h，其注意力计算开销如下：

(n,d/h)×(d/h,n)(n,d/h) imes (d/h,n)(n,d/h)×(d/h,n) 矩阵相乘计算量为 O(n2d/h)O(n^2d/h)O(n2d/h)
softmaxsoftmaxsoftmax 每行复杂度为 O(n)O(n)O(n) 总计算量为 O(n2)O(n^2)O(n2)
VVV 加权和 (n,n)×(n,d/h)(n,n) imes (n,d/h)(n,n)×(n,d/h) 矩阵相乘计算量为 O(n2d/h)O(n^2d/h)O(n2d/h)

多头计算 hhh 组，所以总的计算量为 O(n2d)O(n^2d)O(n2d)

多头注意力实现代码如下：

def attention(
    q: torch.Tensor,
    k: torch.Tensor,
    v: torch.Tensor,
    d_k: int,
    mask: Optional[torch.Tensor] = None,
    dropout: Optional[nn.Dropout] = None,
) -> torch.Tensor:
    # calculate the scores
    # q: [bsz, heads, seq_len, d_k]
    # k: [bsz, heads, d_k, seq_len]
    scores = torch.matmul(q, k.transpose(-1, -2)) / math.sqrt(d_k)

    if mask is not None:
        # tanslate [bsz, seq_len, seq_len] to [bsz, 1, seq_len, seq_len]
        mask = mask.unsqueeze(1)
        scores = scores.masked_fill(mask == 0, -1e9)
    scores = F.softmax(scores, dim=-1)

    if dropout is not None:
        scores = dropout(scores)

    output = torch.matmul(scores, v)
    return output

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model: int, heads: int = 8, dropout: int = 0.1) -> None:
        super().__init__()

        self.d_model = d_model
        self.heads = heads
        self.d_k = self.d_model // self.heads

        if self.heads * self.d_k != self.d_model:
            raise ValueError(
                f"d_model must be divisible by heads (got `d_model`: {self.d_model}"
                f" and `heads`: {self.heads})."
            )

        self.q_proj = nn.Linear(d_model, d_model, bias=False)
        self.k_proj = nn.Linear(d_model, d_model, bias=False)
        self.v_proj = nn.Linear(d_model, d_model, bias=False)

        self.dropout = nn.Dropout(dropout)
        self.o_proj = nn.Linear(d_model, d_model, bias=False)

    def forward(
        self,
        q: torch.Tensor,
        k: torch.Tensor,
        v: torch.Tensor,
        mask: Optional[torch.Tensor] = None,
    ):
        bsz = q.shape[0]

        # translate [bsz, seq_len, d_model] to [bsz, seq_len, heads, d_k]
        q = self.q_proj(q).view(bsz, -1, self.heads, self.d_k)
        k = self.k_proj(k).view(bsz, -1, self.heads, self.d_k)
        v = self.v_proj(v).view(bsz, -1, self.heads, self.d_k)

        # translate [bsz, seq_len, heads, d_k] to [bsz, heads, seq_len, d_k]
        q = q.transpose(1, 2)
        k = k.transpose(1, 2)
        v = v.transpose(1, 2)

        # calculate attention
        scores = attention(q, k, v, self.d_k, mask, self.dropout)

        # cat multi-heads
        concat = scores.transpose(1, 2).contiguous().view(bsz, -1, self.d_model)
        output = self.o_proj(concat)
        return output

3.2.3 Applications of Attention in our Model

The Transformer uses multi-head attention in three different ways:
In “encoder-decoder attention” layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder. This allows every position in the decoder to attend over all positions in the input sequence. This mimics the typical encoder-decoder attention mechanisms in sequence-to-sequence models such as [38, 2, 9].
The encoder contains self-attention layers. In a self-attention layer all of the keys, values and queries come from the same place, in this case, the output of the previous layer in the encoder. Each position in the encoder can attend to all positions in the previous layer of the encoder.
Similarly, self-attention layers in the decoder allow each position in the decoder to attend to all positions in the decoder up to and including that position. We need to prevent leftward information flow in the decoder to preserve the auto-regressive property. We implement this inside of scaled dot-product attention by masking out (setting to −∞) all values in the input of the softmax which correspond to illegal connections. See Figure 2.

Transformer 以三种不同的方式使用多头注意力：

在编码器和解码器交互的层，queries 来自之前解码器的输出，而 keys values 来自编码器输出。这使得编码器每一个位置都获取输入序列的全部位置信息。这模仿了 Seq2Seq 模型中典型的编码器-解码器采用的注意力机制。
编码器包括自注意力层，在自注意力层所有的 keys values queries 来自前一个编码器层的输出。编码器中每个位置都可以获得之前层的所有位置信息。
与之相似，解码器中的自注意力层使得解码器中每个位置都可以获取包括该位置在内的之前所有位置。同时要阻止解码器中位置信息向左流动来保持自回归性质。阻止位置信息流动是通过在 softmax 前 mask 所有非法联系的值，而不是通过缩放点积注意力。

3.3 Position-wise Feed-Forward Networks

In addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully connected feed-forward network, which is applied to each position separately and identically. This consists of two linear transformations with a ReLU activation in between.

在编码器和解码器中，除了注意力子层，还有一个全连接的前馈网络，它独立的应用于每个位置。它由两个线性变化夹着一个 ReLU 激活函数组成。

FFN(x)=max(0,xW1+b1)W2+b2FFN(x) = max(0, xW_1 + b_1)W_2 + b_2FFN(x)=max(0,xW1+b1)W2+b2

While the linear transformations are the same across different positions, they use different parameters from layer to layer. Another way of describing this is as two convolutions with kernel size 1. The dimensionality of input and output is dmodel=512d_{model} = 512dmodel=512, and the inner-layer has dimensionality dff=2048d_{ff} = 2048dff=2048.

尽管在不同位置的线性变换相同，但层与层之间参数不同。换言之，可以视为两个核大小为1的卷积。模型输入和输出的维度都为 512，中间层的维度为 2048。

3.4 Embeddings and Softmax

Similarly to other sequence transduction models, we use learned embeddings to convert the input tokens and output tokens to vectors of dimension dmodeld_{model}dmodel. We also use the usual learned linear transformation and softmax function to convert the decoder output to predicted next-token probabilities. In our model, we share the same weight matrix between the two embedding layers and the pre-softmax linear transformation, similar to [30]. In the embedding layers, we multiply those weights by dmodel\sqrt{d_{model}}dmodel.

使用经过学习的 embedding 将输入和输出 token 转换成 dmodeld_{model}dmodel 维的向量。同时将编码器输出经过线性变换后经过 softmax 函数来预测下一个 token 的概率。Transformer 两个 embedding 层和 softmax 前的线性变换使用相同的权重矩阵。在 embedding 层，会将权重乘以 dmodel\sqrt{d_{model}}dmodel。

注： Transformer 的初始化方式中，embedding 权重满足如下分布：

elementofEmbeddingvector∼N(0,1/dmodel)element of Embedding vector \sim N(0,1/d_{model})elementofEmbeddingvector∼N(0,1/dmodel)

这样权重分布会随着 dmodeld_{model}dmodel 变化，乘 dmodel\sqrt{d_{model}}dmodel 后会将分布调回到 (0,1)(0, 1)(0,1)。

3.5 Positional Encoding

Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence. To this end, we add “positional encodings” to the input embeddings at the bottoms of the encoder and decoder stacks. The positional encodings have the same dimension dmodeld_{model}dmodel as the embeddings, so that the two can be summed. There are many choices of positional encodings, learned and fixed [9]. In this work, we use sine and cosine functions of different frequencies

模型没有采用循环或者卷积，为了获取位置信息，需要额外加入 token 在序列中相对或绝对位置信息。位置编码维度为 dmodeld_{model}dmodel，以便可以和 embedding 相加。位置编码可以选择固定位置编码，也可以选择可学习的位置编码。Transformer 中使用不同频率的正弦和余弦函数作为位置编码。

That is, each dimension of the positional encoding corresponds to a sinusoid. The wavelengths form a geometric progression from 2π to 10000 · 2π. We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset k, PEpos+kPE_{pos+k}PEpos+k can be represented as a linear function of PEposPE_{pos}PEpos.

位置编码的每一维度对应一个正弦曲线，它的波长为从 2π 到 10000 · 2π 的等比数列。选取这个函数基于假设模型能够轻易学习相对位置信息。

We also experimented with using learned positional embeddings [9] instead, and found that the two versions produced nearly identical results (see Table 3 row (E)). We chose the sinusoidal version because it may allow the model to extrapolate to sequence lengths longer than the ones encountered during training.

使用可学习的位置编码获得了几乎相同的效果，但是正弦曲线计算可以使得模型在训练时外推更长的距离。

位置编码实现如下：

class PositionalEncoder(nn.Module):
    def __init__(
        self, d_model: int = 512, max_seq_len: int = 2048, base: int = 10000
    ) -> None:
        super().__init__()
        self.d_model = d_model

        inv_freq_half = 1.0 / (
            base ** (torch.arange(0, d_model, 2, dtype=torch.float) / d_model)
        )
        inv_freq = torch.arange(0, d_model, dtype=inv_freq_half.dtype)
        inv_freq[..., 0::2] = inv_freq_half
        inv_freq[..., 1::2] = inv_freq_half

        pos = torch.arange(max_seq_len, dtype=inv_freq.dtype)

        pe = torch.einsum("i, j -> ij", pos, inv_freq)
        pe[..., 0::2] = pe[..., 0::2].sin()
        pe[..., 1::2] = pe[..., 1::2].cos()

        self.register_buffer("pe", pe)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # 使 embedding 相对大一些
        x = x * math.sqrt(self.d_model)
        seq_len = x.shape[1]
        pe = self.pe[:seq_len].to(dtype=x.dtype)
        return x + pe

4 Why Self-Attention

In this section we compare various aspects of self-attention layers to the recurrent and convolutional layers commonly used for mapping one variable-length sequence of symbol representations (x1,…,xn)(x_1, …, x_n)(x1,…,xn) to another sequence of equal length (z1,…,zn)(z_1, …, z_n)(z1,…,zn), with xi,zi∈Rdx_i, z_i ∈ R^dxi,zi∈Rd, such as a hidden layer in a typical sequence transduction encoder or decoder. Motivating our use of self-attention we consider three desiderata.

按照三个指标比较自注意力在卷换和循环神经网络应用。

One is the total computational complexity per layer. Another is the amount of computation that can be parallelized, as measured by the minimum number of sequential operations required

第一个指标是每层总共的计算复杂度。

第二个指标是可以并行的计算量，通过需要的最小序列操作数来衡量。

The third is the path length between long-range dependencies in the network. Learning long-range dependencies is a key challenge in many sequence transduction tasks. One key factor affecting the ability to learn such dependencies is the length of the paths forward and backward signals have to traverse in the network. The shorter these paths between any combination of positions in the input and output sequences, the easier it is to learn long-range dependencies [12]. Hence we also compare the maximum path length between any two input and output positions in networks composed of the different layer types

第三个指标是网络中长距离依赖的路径长度。在许多序列转换任务中，学习长距离依赖是很大的挑战。

影响模型学习距离依赖能力的关键因素是网络中符号在前向和后向传播的距离。输入和输出序列中任何位置的符号关联路径越短，模型越容易学到长距离依赖，因此第三个指标是计算网络中不同层中输入和输出位置的最大路径距离。

Layer Type	Complexity per Layer	Sequential Operations	Maximum Path Length
Self-Attention	O(n2⋅d)O(n^2 \cdot d)O(n2⋅d)	O(1)O(1)O(1)	O(1)O(1)O(1)
Recurrent	O(n⋅d2)O(n \cdot d^2)O(n⋅d2)	O(n)O(n)O(n)	O(n)O(n)O(n)
Convolutional	O(k⋅n⋅d2)O(k \cdot n \cdot d^2)O(k⋅n⋅d2)	O(1)O(1)O(1)	O(logk(n))O(log_k(n))O(logk(n))
Self-Attention (restricted)	O(r⋅n⋅d)O(r \cdot n \cdot d)O(r⋅n⋅d)	O(1)O(1)O(1)	O(n/r)O(n/r)O(n/r)

Transformer 中注意力层计算复杂度前文已经给出，这里计算循环神经网络每层复杂度。

循环神经网络关于注意力的公式如下：

ht=f(xtU+ht−1W)h_t=f(x_tU+h_{t-1}W)ht=f(xtU+ht−1W)

对于其中一个时间步，计算复杂度由如下组成：

xtU→(1,n)×(n,d)x_tU o (1,n) imes (n,d)xtU→(1,n)×(n,d) 计算复杂度为 O(nd)O(nd)O(nd)
ht−1W→(1,d)×(d,d)h_{t-1}W o (1,d) imes (d,d)ht−1W→(1,d)×(d,d) 计算复杂度为 O(d2)O(d^2)O(d2)

一个时间步计算复杂度为 O(nd+d2)O(nd+d^2)O(nd+d2)，计算 nnn 个时间步复杂度为 O(n2d+nd2)O(n^2d+nd2)O(n2d+nd2)。

In terms of computational complexity, self-attention layers are faster than recurrent layers when the sequence length nnn is smaller than the representation dimensionality ddd, which is most often the case with sentence representations used by state-of-the-art models in machine translations, such as word-piece [38] and byte-pair [31] representations.

大部分情况下 n<dn<dn<d，因此计算复杂度为 O(nd2)O(nd^2)O(nd2)。

To improve computational performance for tasks involving very long sequences, self-attention could be restricted to considering only a neighborhood of size rrr in the input sequence centered around the respective output position. This would increase the maximum path length to O(n/r)O(n/r)O(n/r). We plan to investigate this approach further in future work.

为了提升在长序列长的计算性能，提出一个受限自注意力，在受限自注意力中，对于输入序列中的一个位置，只能看到以它对应输出序列位置为中心，大小为 rrr 的信息。这使得最大路径长度从 O(1)O(1)O(1) 提升到了 O(n/r)O(n/r)O(n/r)。

对于单层卷积神经网络，其卷积核大小为 k<nk<nk<n，不可连接所有输入输出位置。需要堆叠 O(n/k)O(n/k)O(n/k) 个卷积层。

对于卷积层，每层计算复杂度分析如下：

假设一个卷积核的大小为 (k,d,1)(k,d,1)(k,d,1)，计算一次卷积得到结果维度为 (1,1)(1,1)(1,1) 复杂度为 O(kd)O(kd)O(kd)，对于长度为 nnn 的输入需要 padding 后做 nnn 次卷积，计算复杂度为 O(nkd)O(nkd)O(nkd)，得到向量维度为 (n,1)(n,1)(n,1)。因此需要 ddd 个卷积核，故计算总量为 O(nkd2)O(nkd^2)O(nkd2)

As side benefit, self-attention could yield more interpretable models. We inspect attention distributions from our models and present and discuss examples in the appendix. Not only do individual attention heads clearly learn to perform different tasks, many appear to exhibit behavior related to the syntactic and semantic structure of the sentences

自注意力可以得到更具有解释性的模型。不仅每个注意力头清晰地学会了执行不同的任务，许多注意力头似乎表现出与句子的句法和语义结构相关的行为。

5 Training

5.2 Hardware and Schedule

We trained our models on one machine with 8 NVIDIA P100 GPUs. For our base models using the hyperparameters described throughout the paper, each training step took about 0.4 seconds. We trained the base models for a total of 100,000 steps or 12 hours. For our big models,(described on the bottom line of table 3), step time was 1.0 seconds. The big models were trained for 300,000 steps (3.5 days).

训练使用 8 张 P100。

base 模型训练一步大约 0.4s，共训练 100,000 步 12 小时。

big 模型训练一步大约 1.0s，共训练 300,000 步 3.5 天。

5.3 Optimizer

采用 Adam 优化器，参数为 β1=0.9,β2=0.98,ϵ=10−9\beta_1=0.9,\beta_2=0.98,\epsilon=10^{-9}β1=0.9,β2=0.98,ϵ=10−9

学习率采用下面公式进行动态调整：

lrate=dmodel−0.5⋅min(step_num)−0.5,step_num⋅warmup_steps−1.5lrate=d_{model}^{-0.5} \cdot min(step\_num)^{-0.5},step\_num \cdot warmup\_steps^{-1.5}lrate=dmodel−0.5⋅min(step_num)−0.5,step_num⋅warmup_steps−1.5

转换成更直观的形式：

lrate=min(1step_num,step_numwarmup_steps1.5)dmodellrate=\frac{min(\frac{1}{\sqrt{step\_num}},\frac{step\_num}{warmup\_steps^{1.5}})}{\sqrt{d_{model}}}lrate=dmodelmin(step_num1,warmup_steps1.5step_num)

This corresponds to increasing the learning rate linearly for the first warmup_stepswarmup\_stepswarmup_steps training steps, and decreasing it thereafter proportionally to the inverse square root of the step number. We used warmup_steps=4000warmup\_steps = 4000warmup_steps=4000.

起初 warmup_stepswarmup\_stepswarmup_steps 学习率线性提高，随后按照 1step_num\frac{1}{\sqrt{step\_num}}step_num1 比例降低。

5.4 Regularization

Residual Dropout We apply dropout [33] to the output of each sub-layer, before it is added to the sub-layer input and normalized. In addition, we apply dropout to the sums of the embeddings and the positional encodings in both the encoder and decoder stacks. For the base model, we use a rate of Pdrop = 0.1.

Label Smoothing During training, we employed label smoothing of value ϵls=0.1ϵ_{ls} = 0.1ϵls=0.1 [36]. This hurts perplexity, as the model learns to be more unsure, but improves accuracy and BLEU score.

每个子层输出之后，在残差和正则化前加入 dropout。除此之外，在 embedding 和位置编码求和之后加入 dropout。训练过程中使用了标签平滑，增加了困惑度，但是提高了准确率。

Summary

这次详细读完了论文，从中又了解了不少细节。包括复杂度计算，之前没有详细思考过。

那么，我们该如何学习大模型？

作为一名热心肠的互联网老兵，我决定把宝贵的AI知识分享给大家。至于能学习到多少就看你的学习毅力和能力了。我已将重要的AI大模型资料包括AI大模型入门学习思维导图、精品AI大模型学习书籍手册、视频教程、实战学习等录播视频免费分享出来。

一、大模型全套的学习路线

学习大型人工智能模型，如GPT-3、BERT或任何其他先进的神经网络模型，需要系统的方法和持续的努力。既然要系统的学习大模型，那么学习路线是必不可少的，下面的这份路线能帮助你快速梳理知识，形成自己的体系。

L1级别:AI大模型时代的华丽登场
Attention Is All You Need 阅读详记-结合代码实现插图
L2级别：AI大模型API应用开发工程

L3级别：大模型应用架构进阶实践

L4级别：大模型微调与私有化部署

一般掌握到第四个级别，市场上大多数岗位都是可以胜任，但要还不是天花板，天花板级别要求更加严格，对于算法和实战是非常苛刻的。建议普通人掌握到L4级别即可。

以上的AI大模型学习路线，不知道为什么发出来就有点糊，高清版可以微信扫描下方CSDN官方认证二维码免费领取【保证100%免费】

Attention Is All You Need 阅读详记-结合代码实现插图(4)

二、640套AI大模型报告合集

这套包含640份报告的合集，涵盖了AI大模型的理论研究、技术实现、行业应用等多个方面。无论您是科研人员、工程师，还是对AI大模型感兴趣的爱好者，这套报告合集都将为您提供宝贵的信息和启示。

Attention Is All You Need 阅读详记-结合代码实现插图(5)

三、大模型经典PDF籍

随着人工智能技术的飞速发展，AI大模型已经成为了当今科技领域的一大热点。这些大型预训练模型，如GPT-3、BERT、XLNet等，以其强大的语言理解和生成能力，正在改变我们对人工智能的认识。那以下这些PDF籍就是非常不错的学习资源。

Attention Is All You Need 阅读详记-结合代码实现插图(6)

四、AI大模型商业化落地方案

Attention Is All You Need 阅读详记-结合代码实现插图(7)

作为普通人，入局大模型时代需要持续学习和实践，不断提高自己的技能和认知水平，同时也需要有责任感和伦理意识，为人工智能的健康发展贡献力量。

2024年六月
一	二	三	四	五	六	日
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31