【深度学习】【NLP】Bert理论，代码-个人在线分享

论文：
https://arxiv.org/abs/1810.04805

文章目录

一、Bert理论
- - BERT 模型公式
  - - 1. 输入表示 (Input Representation)
    - 2. 自注意力机制 (Self-Attention Mechanism)
    - 3. Transformer 层 (Transformer Layer)
二、便于理解Bert的代码
- - 1. 自注意力机制
  - 2. Transformer 层
  - 3. 位置编码
  - 4. BERT 模型
  - 解释代码
三、Bert文本多分类任务
- - 输入预处理
  - 公式表示：
  - BERT编码器
  - 公式表示：
  - 分类任务
  - 公式表示：
  - 损失函数
  - 公式表示：
  - 总结
四、Bert有啥创新点

一、Bert理论

BERT (Bidirectional Encoder Representations from Transformers) 是一个由Google开发的自然语言处理预训练模型。BERT在多个NLP任务中取得了显著的效果，主要因为它能够利用句子中所有单词的上下文信息进行训练和预测。下面从公式和代码两个角度进行讲解。

BERT 模型公式

1. 输入表示 (Input Representation)

BERT 的输入由三个嵌入层组成：

Token Embeddings：词嵌入，表示句子中的每个词。
Segment Embeddings：句子嵌入，用于区分两个句子。
Position Embeddings：位置嵌入，表示每个词在句子中的位置。

输入向量表示为：

Input

Token Embedding

Segment Embedding

Position Embedding

ext{Input} = ext{Token Embedding} + ext{Segment Embedding} + ext{Position Embedding}

$Input = Token Embedding + Segment Embedding + Position Embedding$

2. 自注意力机制 (Self-Attention Mechanism)

BERT 的核心是 Transformer 的多头自注意力机制。自注意力的计算公式如下：

Attention

(

)

softmax

(

)

ext{Attention}(Q, K, V) = ext{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V

$Attention (Q, K, V) = softmax (d _{k}$

QKT)V

其中

Q, K, V

$Q, K, V$ 分别表示查询（Query）、键（Key）、值（Value）矩阵，

d_k

$d_{k}$ 是键的维度。

多头注意力将多个注意力头的结果进行连接：

MultiHead

(

)

Concat

(

head

…

head

)

ext{MultiHead}(Q, K, V) = ext{Concat}( ext{head}_1, ext{head}_2, \ldots, ext{head}_h)W^O

$MultiHead (Q, K, V) = Concat (head_{1}, head_{2}, \dots, head_{h}) W^{O}$

每个头的计算如下：

head

Attention

(

)

ext{head}_i = ext{Attention}(QW_i^Q, KW_i^K, VW_i^V)

$head_{i} = Attention (Q W_{i Q}, K W_{i K}, V W_{i V})$

3. Transformer 层 (Transformer Layer)

每个 Transformer 层包含多头自注意力和前馈神经网络：

Output

LayerNorm

(

MultiHead

(

)

Input

)

ext{Output} = ext{LayerNorm}( ext{MultiHead}(Q, K, V) + ext{Input})

$Output = LayerNorm (MultiHead (Q, K, V) + Input)$

Output

LayerNorm

(

FFN

(

Output

)

Output

)

ext{Output} = ext{LayerNorm}( ext{FFN}( ext{Output}) + ext{Output})

$Output = LayerNorm (FFN (Output) + Output)$

前馈神经网络的定义如下：

FFN

(

)

max

⁡

(

)

ext{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2

$FFN (x) = max (0, x W_{1} + b_{1}) W_{2} + b_{2}$

二、便于理解Bert的代码

以下是一个基于 PyTorch 从零实现 BERT 的简化版示例。这个实现包括自注意力机制、多头注意力、位置编码和 Transformer 层。

1. 自注意力机制

首先实现自注意力机制：

import torch
import torch.nn as nn
import math
class SelfAttention(nn.Module):
def __init__(self, embed_size, heads):
super(SelfAttention, self).__init__()
self.embed_size = embed_size
self.heads = heads
self.head_dim = embed_size // heads
assert self.head_dim * heads == embed_size, "Embedding size needs to be divisible by heads"
self.values = nn.Linear(self.head_dim, self.head_dim, bias=False)
self.keys = nn.Linear(self.head_dim, self.head_dim, bias=False)
self.queries = nn.Linear(self.head_dim, self.head_dim, bias=False)
self.fc_out = nn.Linear(heads * self.head_dim, embed_size)
def forward(self, values, keys, query, mask):
N = query.shape[0]
value_len, key_len, query_len = values.shape[1], keys.shape[1], query.shape[1]
# Split embedding into self.heads pieces
values = values.reshape(N, value_len, self.heads, self.head_dim)
keys = keys.reshape(N, key_len, self.heads, self.head_dim)
queries = query.reshape(N, query_len, self.heads, self.head_dim)
values = self.values(values)
keys = self.keys(keys)
queries = self.queries(queries)
energy = torch.einsum("nqhd,nkhd->nhqk", [queries, keys]) # Queries dot product Keys
if mask is not None:
energy = energy.masked_fill(mask == 0, float("-1e20"))
attention = torch.softmax(energy / (self.embed_size ** (1 / 2)), dim=3) # Scaled dot-product
out = torch.einsum("nhql,nlhd->nqhd", [attention, values]).reshape(N, query_len, self.heads * self.head_dim)
out = self.fc_out(out)
return out

2. Transformer 层

接下来实现一个完整的 Transformer 层，包括多头自注意力和前馈神经网络：

class TransformerBlock(nn.Module):
def __init__(self, embed_size, heads, dropout, forward_expansion):
super(TransformerBlock, self).__init__()
self.attention = SelfAttention(embed_size, heads)
self.norm1 = nn.LayerNorm(embed_size)
self.norm2 = nn.LayerNorm(embed_size)
self.feed_forward = nn.Sequential(
nn.Linear(embed_size, forward_expansion * embed_size),
nn.ReLU(),
nn.Linear(forward_expansion * embed_size, embed_size)
)
self.dropout = nn.Dropout(dropout)
def forward(self, value, key, query, mask):
attention = self.attention(value, key, query, mask)
x = self.dropout(self.norm1(attention + query))
forward = self.feed_forward(x)
out = self.dropout(self.norm2(forward + x))
return out

3. 位置编码

实现位置编码，帮助模型理解单词在句子中的位置：

class PositionalEncoding(nn.Module):
def __init__(self, embed_size, max_length):
super(PositionalEncoding, self).__init__()
self.encoding = torch.zeros(max_length, embed_size)
self.encoding.requires_grad = False
pos = torch.arange(0, max_length).float().unsqueeze(1)
_2i = torch.arange(0, embed_size, step=2).float()
self.encoding[:, 0::2] = torch.sin(pos / (10000 ** (_2i / embed_size)))
self.encoding[:, 1::2] = torch.cos(pos / (10000 ** (_2i / embed_size)))
def forward(self, x):
batch_size, seq_len, embed_size = x.size()
return x + self.encoding[:seq_len, :].to(x.device)

4. BERT 模型

将所有部分组合到 BERT 模型中：

class BERT(nn.Module):
def __init__(self, 
vocab_size, 
embed_size=768, 
num_layers=12, 
heads=12, 
forward_expansion=4, 
dropout=0.1, 
max_length=512):
super(BERT, self).__init__()
self.embedding = nn.Embedding(vocab_size, embed_size)
self.position_encoding = PositionalEncoding(embed_size, max_length)
self.layers = nn.ModuleList(
[TransformerBlock(embed_size, heads, dropout, forward_expansion) for _ in range(num_layers)]
)
self.dropout = nn.Dropout(dropout)
def forward(self, x, mask):
out = self.embedding(x)
out = self.position_encoding(out)
out = self.dropout(out)
for layer in self.layers:
out = layer(out, out, out, mask)
return out
# 用法示例
# 定义参数
vocab_size = 30522  # 词汇表大小（例如 BERT-base 的词汇表）
embed_size = 768    # 嵌入维度（例如 BERT-base）
num_layers = 12     # Transformer 层数
heads = 12          # 注意力头数
max_length = 512    # 最大序列长度
# 创建 BERT 模型实例
model = BERT(vocab_size, embed_size, num_layers, heads, max_length=max_length)
# 输入张量
input_ids = torch.randint(0, vocab_size, (1, max_length))  # 示例输入
# 假设没有掩码
mask = None
# 前向传播
output = model(input_ids, mask)
print(output.shape)  # 输出张量的形状

解释代码

自注意力机制：
- SelfAttention 类实现了多头自注意力机制。
- forward 方法计算注意力权重并应用到值上。
Transformer 层：
- TransformerBlock 类结合了多头自注意力和前馈神经网络。
- forward 方法执行自注意力和前馈过程，并应用层归一化和残差连接。
位置编码：
- PositionalEncoding 类为输入添加位置信息。
- forward 方法将位置编码添加到输入嵌入上。
BERT 模型：
- BERT 类组合了嵌入层、位置编码和多个 Transformer 层。
- forward 方法依次通过嵌入、位置编码、dropout 和多个 Transformer 层。

这个实现展示了 BERT 的核心机制，但它是一个简化版本，适合理解 BERT 的内部工作原理。在实际应用中，使用现成的库（如 transformers）更为高效和可靠。

三、Bert文本多分类任务

要实现基于BERT的文本多分类任务，我们需要使用Transformers库和PyTorch。下面是完整的代码，包括数据加载、模型训练、评估和绘制损失变化曲线和准确率变化曲线。

使用BertForSequenceClassification。

首先，我们需要安装所需的库：

pip install transformers torch scikit-learn matplotlib pandas

以下是完整的代码：

import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from transformers import BertTokenizer, BertForSequenceClassification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
import matplotlib.pyplot as plt
import numpy as np
# 加载数据
df = pd.read_csv('data.csv')  # 假设文件名为data.csv
df['Label'] = df['Label'].astype('category').cat.codes  # 将Label转换为类别编码
# 数据集类
class TextDataset(Dataset):
def __init__(self, texts, labels, tokenizer, max_len):
self.texts = texts
self.labels = labels
self.tokenizer = tokenizer
self.max_len = max_len
def __len__(self):
return len(self.texts)
def __getitem__(self, idx):
text = self.texts[idx]
label = self.labels[idx]
encoding = self.tokenizer.encode_plus(
text,
add_special_tokens=True,
max_length=self.max_len,
return_token_type_ids=False,
padding='max_length',
truncation=True,
return_attention_mask=True,
return_tensors='pt',
)
return {
'text': text,
'input_ids': encoding['input_ids'].flatten(),
'attention_mask': encoding['attention_mask'].flatten(),
'label': torch.tensor(label, dtype=torch.long)
}
# 准备数据
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
MAX_LEN = 128
BATCH_SIZE = 16
train_texts, val_texts, train_labels, val_labels = train_test_split(df['seq'], df['Label'], test_size=0.2, random_state=42)
train_dataset = TextDataset(train_texts.tolist(), train_labels.tolist(), tokenizer, MAX_LEN)
val_dataset = TextDataset(val_texts.tolist(), val_labels.tolist(), tokenizer, MAX_LEN)
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE)
# 模型定义
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=5)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)
# 训练参数
EPOCHS = 3
optimizer = optim.Adam(model.parameters(), lr=2e-5)
criterion = nn.CrossEntropyLoss()
# 训练和评估函数
def train_epoch(model, data_loader, criterion, optimizer, device, scheduler, n_examples):
model = model.train()
losses = []
correct_predictions = 0
for d in data_loader:
input_ids = d["input_ids"].to(device)
attention_mask = d["attention_mask"].to(device)
labels = d["label"].to(device)
outputs = model(
input_ids=input_ids,
attention_mask=attention_mask
)
loss = criterion(outputs.logits, labels)
correct_predictions += torch.sum(torch.argmax(outputs.logits, dim=1) == labels)
losses.append(loss.item())
loss.backward()
optimizer.step()
optimizer.zero_grad()
return correct_predictions.double() / n_examples, np.mean(losses)
def eval_model(model, data_loader, criterion, device, n_examples):
model = model.eval()
losses = []
correct_predictions = 0
with torch.no_grad():
for d in data_loader:
input_ids = d["input_ids"].to(device)
attention_mask = d["attention_mask"].to(device)
labels = d["label"].to(device)
outputs = model(
input_ids=input_ids,
attention_mask=attention_mask
)
loss = criterion(outputs.logits, labels)
correct_predictions += torch.sum(torch.argmax(outputs.logits, dim=1) == labels)
losses.append(loss.item())
return correct_predictions.double() / n_examples, np.mean(losses)
# 训练模型
history = {
'train_acc': [],
'train_loss': [],
'val_acc': [],
'val_loss': []
}
best_accuracy = 0
for epoch in range(EPOCHS):
print(f'Epoch {epoch + 1}/{EPOCHS}')
print('-' * 10)
train_acc, train_loss = train_epoch(
model,
train_loader,
criterion,
optimizer,
device,
None,
len(train_dataset)
)
print(f'Train loss {train_loss} accuracy {train_acc}')
val_acc, val_loss = eval_model(
model,
val_loader,
criterion,
device,
len(val_dataset)
)
print(f'Val loss {val_loss} accuracy {val_acc}')
print()
history['train_acc'].append(train_acc)
history['train_loss'].append(train_loss)
history['val_acc'].append(val_acc)
history['val_loss'].append(val_loss)
if val_acc > best_accuracy:
torch.save(model.state_dict(), 'best_model_state.bin')
best_accuracy = val_acc
# 绘制损失和准确率曲线
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.plot(history['train_loss'], label='train loss')
plt.plot(history['val_loss'], label='val loss')
plt.legend()
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Loss Curve')
plt.subplot(1, 2, 2)
plt.plot(history['train_acc'], label='train accuracy')
plt.plot(history['val_acc'], label='val accuracy')
plt.legend()
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.title('Accuracy Curve')
plt.show()

该代码执行以下步骤：

加载并预处理数据。
定义一个PyTorch数据集类以便于数据加载。
初始化BERT分词器和模型。
分割数据集并创建数据加载器。
定义训练和评估函数。
训练模型，记录每个epoch的损失和准确率，并保存最佳模型。
绘制训练和验证的损失和准确率曲线。

BERT的文本分类任务中，数据流动可以分为几个步骤。下面将详细描述BERT在文本分类任务中的内部数据流动，并用公式表示。

输入预处理

假设输入文本为T，通过BERT的分词器将其转换为词汇表中的ID。

输入文本：T = "Hello, how are you?"
分词后的ID表示：input_ids = [101, 7592, 1010, 2129, 2024, 2017, 102]
注意力掩码：attention_mask = [1, 1, 1, 1, 1, 1, 1]

公式表示：

输入序列： $[x_1, x_2, \ldots, x_n] X=[x1,x2,…,xn]，其中 x i x_i xi表示第 i i i个token的嵌入向量。$
注意力掩码： $[a_1, a_2, \ldots, a_n] A=[a1,a2,…,an]，其中 a i a_i ai表示第 i i i个token的注意力掩码。$

BERT编码器

BERT编码器由多个自注意力层和前馈神经网络层堆叠组成。

每个自注意力层的输出： $H_l = ext{TransformerLayer}(H_{l-1}, A) Hl=TransformerLayer(Hl−1,A)，其中 H l − 1 H_{l-1} Hl−1是上一层的输出（初始输入为 X X X）， A A A是注意力掩码。$
最终隐藏状态： $H_L HL，其中 L L L是BERT的层数。$

公式表示：

H_0 = X

$H_{0} = X$

TransformerLayer

(

−

)

for

…

H_l = ext{TransformerLayer}(H_{l-1}, A) \quad ext{for} \quad l = 1, 2, \ldots, L

$H_{l} = TransformerLayer (H_{l - 1}, A) for l = 1, 2, \dots, L$

BERT

(

)

H_L = ext{BERT}(X, A)

$H_{L} = BERT (X, A)$

分类任务

BERT的最后一层隐藏状态

H_L

$H_{L}$ 的第一个token（[CLS] token）的向量表示用于分类任务。

提取[CLS] token的表示： $H_{CLS} = H_L[0] HCLS=HL[0]$
通过一个全连接层进行分类： $\cdot H_{CLS} + b logits=W⋅HCLS+b，其中 W W W是权重矩阵， b b b是偏置向量。$
应用Softmax函数得到类别概率：

在BERT中使用最后一层隐藏状态的[CLS] token向量作为分类任务的表示是BERT特有的设计，而不是所有Transformer模型都采用的策略。在其他Transformer模型中，可能会使用不同的策略来获取表示用于分类任务。

一些变种的Transformer模型或其他NLP模型可能会使用不同的策略，比如：

平均池化（Mean Pooling）：将所有token的表示向量取平均，得到一个整体的句子表示。
最大池化（Max Pooling）：将所有token的表示向量中的最大值作为整体的句子表示。
自注意力池化（Self-Attention Pooling）：通过自注意力机制，动态地对不同token的重要性进行加权，得到整体的句子表示。

比如要将 BertForSequenceClassification 模型改为使用平均池化（Mean Pooling）方式，你需要修改模型的输出部分，以便对所有 token 的表示向量取平均。下面是一个示例代码，演示了如何修改 BertForSequenceClassification 模型以使用平均池化：

import torch
import torch.nn as nn
from transformers import BertModel, BertTokenizer
class BertForMeanPoolingSequenceClassification(nn.Module):
def __init__(self, num_classes, bert_model_name='bert-base-uncased'):
super(BertForMeanPoolingSequenceClassification, self).__init__()
self.bert = BertModel.from_pretrained(bert_model_name)
self.dropout = nn.Dropout(self.bert.config.hidden_dropout_prob)
self.classifier = nn.Linear(self.bert.config.hidden_size, num_classes)
def forward(self, input_ids, attention_mask):
outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
pooled_output = torch.mean(outputs.last_hidden_state, dim=1)  # 使用平均池化
pooled_output = self.dropout(pooled_output)
logits = self.classifier(pooled_output)
return logits
# 使用示例
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMeanPoolingSequenceClassification(num_classes=2, bert_model_name='bert-base-uncased')
input_text = ["Hello, how are you?", "Fine, thank you!"]
inputs = tokenizer(input_text, padding=True, truncation=True, return_tensors="pt")
outputs = model(input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"])

在这个示例中，我们定义了一个新的模型 BertForMeanPoolingSequenceClassification，它使用了平均池化来获取句子的表示。在 forward 方法中，我们计算了所有 token 的表示向量的平均值，然后通过一个全连接层进行分类。

公式表示：

[

]

H_{CLS} = H_L[0]

$H_{C L S} = H_{L} [0]$

logits

⋅

ext{logits} = W \cdot H_{CLS} + b

$logits = W \cdot H_{C L S} + b$

(

∣

)

softmax

(

logits

)

P(y|X) = ext{softmax}( ext{logits})

$P (y ∣ X) = softmax (logits)$

损失函数

分类任务中，通常使用交叉熵损失函数来衡量预测概率与真实标签之间的差异。

交叉熵损失： $-\sum_{i=1}^{C} y_i \log P(y_i|X) L=−∑i=1CyilogP(yi∣X)，其中 C C C是类别数， y i y_i yi是第 i i i个类别的真实标签（one-hot编码）， P ( y i ∣ X ) P(y_i|X) P(yi∣X)是第 i i i个类别的预测概率。$

公式表示：

−

∑

log

⁡

(

∣

)

L = -\sum_{i=1}^{C} y_i \log P(y_i|X)

$L = - i = 1 \sum C y_{i} lo g P (y_{i} ∣ X)$

总结

BERT在文本分类任务中的数据流动过程可以概括如下：

输入文本通过分词器编码为input_ids和attention_mask。
经过BERT编码器，计算得到最后一层隐藏状态 $H_L HL。$
提取 $H_L HL的[CLS] token表示，并通过全连接层计算出类别的logits。$
应用Softmax函数得到类别概率。
使用交叉熵损失函数计算损失。

四、Bert有啥创新点

BERT（Bidirectional Encoder Representations from Transformers）确实基于Transformer架构，但它的创新之处不仅仅在于简单地将Transformer应用于特定任务。以下是BERT相对于原始Transformer论文的一些关键创新点：

双向编码: BERT的核心创新之一是使用双向Transformer编码器，这与传统的自回归语言模型（如Transformer的解码器部分或OpenAI的GPT模型）不同。传统的语言模型通常为单向，即在预测一个词时只能看到它之前的词（左向）或者之后的词（右向），而BERT通过遮蔽语言模型（Masked Language Model, MLM）任务，在训练时同时考虑左侧和右侧的上下文信息，使得模型能够学习到词汇间的双向依赖关系。
预训练与微调策略: BERT引入了大规模的无监督预训练方法，然后针对特定任务进行微调。这种策略极大地简化了针对不同任务设计特定架构的需求，因为只需要在预训练的BERT模型上添加一个额外的输出层即可适应各种下游任务，比如问答、情感分析、命名实体识别等，显著提高了这些任务的性能。
多任务学习: 除了MLM任务外，BERT还采用了“下一句预测”（Next Sentence Prediction, NSP）任务作为预训练的一部分，旨在学习文本对之间的关系，增强模型对语境连贯性的理解。虽然后续研究表明NSP任务可能不是提升性能的关键因素，但它反映了BERT设计时对多任务学习的探索。
大规模数据集: BERT在非常庞大的数据集（包括BooksCorpus和Wikipedia）上进行了预训练，这有助于模型学习更广泛的语言模式和知识。
技术细节优化: BERT在训练过程中使用了更大的批量大小、动态调整的学习率以及其他超参数设置，这些优化策略帮助模型更高效地学习高质量的表示。

相比Transformer的原始论文，BERT的贡献在于展示了双向Transformer在语言理解任务上的巨大潜力，以及通过预训练和微调策略可以极大提升模型的泛化能力，这一思路后来影响了整个自然语言处理领域的发展方向。BERT的这些创新点不仅推动了模型性能的显著提升，也为后续的研究如XLNet、RoBERTa、T5等模型的发展奠定了基础。

2024年六月
一	二	三	四	五	六	日
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

文章目录

一、Bert理论

BERT 模型公式

1. 输入表示 (Input Representation)

2. 自注意力机制 (Self-Attention Mechanism)

3. Transformer 层 (Transformer Layer)

二、便于理解Bert的代码

1. 自注意力机制

2. Transformer 层

3. 位置编码

4. BERT 模型

解释代码

三、Bert文本多分类任务

输入预处理

公式表示：

BERT编码器

公式表示：

分类任务

公式表示：

损失函数

公式表示：

总结

四、Bert有啥创新点

admin 钻石

相关推荐