Bohrium
robot
新建

空间站广场

论文
Notebooks
比赛
课程
Apps
我的主页
我的Notebooks
我的论文库
我的足迹

我的工作空间

任务
节点
文件
数据集
镜像
项目
数据库
公开
BERT预训练模型微调教程
中文
中文
Letian
爱学习的王一博
更新于 2024-07-13
推荐镜像 :Basic Image:bohrium-notebook:2023-04-07
推荐机型 :c2_m4_cpu
赞 1
3
BERT-data2(v1)

BERT预训练模型微调教程

©️ Copyright 2024 @ Authors
作者: 陈乐天📨
日期:2024-05-21
共享协议:本作品采用知识共享署名-非商业性使用-相同方式共享 4.0 国际许可协议进行许可。
快速开始:点击上方的 开始连接 按钮,选择 bohrium-notebook:2023-04-07镜像c12_m46_1 * NVIDIA GPU B 节点配置,稍等片刻即可运行。

代码
文本

BERT(Bidirectional Encoder Representations from Transformers)是一种基于Transformer的语言表示模型,通过无监督的方式进行预训练,然后可以通过微调在各种自然语言处理任务上取得优异的表现。下面是一个详细的教程,介绍如何基于BERT模型进行微调。

准备工作

首先,我们需要安装必要的Python库,包括 transformersdatasets。我们可以通过以下命令安装它们:

代码
文本
[1]
# 设置一下网络IP
import os
os.environ['HTTP_PROXY'] = 'http://ga.dp.tech:8118'
os.environ['HTTPS_PROXY'] = 'http://ga.dp.tech:8118'
代码
文本
[2]
# 更新一下库
! pip install --upgrade transformers huggingface_hub
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Requirement already satisfied: transformers in /opt/conda/lib/python3.8/site-packages (4.27.1)
Collecting transformers
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/07/78/c23e1c70b89f361d855a5d0a19b229297f6456961f9a1afa9a69cd5a70c3/transformers-4.41.0-py3-none-any.whl (9.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 9.1/9.1 MB 43.6 MB/s eta 0:00:0000:0100:01
Requirement already satisfied: huggingface_hub in /opt/conda/lib/python3.8/site-packages (0.13.2)
Collecting huggingface_hub
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/92/27/1a30d8082ef3c8615ae198b9d451fafffdab815b96727ec3c06befc27ebe/huggingface_hub-0.23.1-py3-none-any.whl (401 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 401.3/401.3 kB 66.6 MB/s eta 0:00:00
Requirement already satisfied: numpy>=1.17 in /opt/conda/lib/python3.8/site-packages (from transformers) (1.22.4)
Requirement already satisfied: filelock in /opt/conda/lib/python3.8/site-packages (from transformers) (3.9.0)
Requirement already satisfied: pyyaml>=5.1 in /opt/conda/lib/python3.8/site-packages (from transformers) (6.0)
Requirement already satisfied: tqdm>=4.27 in /opt/conda/lib/python3.8/site-packages (from transformers) (4.64.1)
Requirement already satisfied: regex!=2019.12.17 in /opt/conda/lib/python3.8/site-packages (from transformers) (2022.6.2)
Collecting safetensors>=0.4.1
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/41/ae/7b9e79467ab81884b457214eace4b20214e286277b75c47150ff297c8561/safetensors-0.4.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 MB 79.9 MB/s eta 0:00:00
Collecting tokenizers<0.20,>=0.19
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/18/0d/ee99f50407788149bc9eddae6af0b4016865d67fb687730d151683b13b80/tokenizers-0.19.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.6/3.6 MB 44.7 MB/s eta 0:00:0000:0100:01
Requirement already satisfied: packaging>=20.0 in /opt/conda/lib/python3.8/site-packages (from transformers) (23.0)
Requirement already satisfied: requests in /opt/conda/lib/python3.8/site-packages (from transformers) (2.28.2)
Collecting fsspec>=2023.5.0
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/ba/a3/16e9fe32187e9c8bc7f9b7bcd9728529faa725231a0c96f2f98714ff2fc5/fsspec-2024.5.0-py3-none-any.whl (316 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 316.1/316.1 kB 63.7 MB/s eta 0:00:00
Requirement already satisfied: typing-extensions>=3.7.4.3 in /opt/conda/lib/python3.8/site-packages (from huggingface_hub) (4.5.0)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/conda/lib/python3.8/site-packages (from requests->transformers) (1.26.14)
Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.8/site-packages (from requests->transformers) (3.4)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.8/site-packages (from requests->transformers) (2022.12.7)
Requirement already satisfied: charset-normalizer<4,>=2 in /opt/conda/lib/python3.8/site-packages (from requests->transformers) (3.0.1)
Installing collected packages: safetensors, fsspec, huggingface_hub, tokenizers, transformers
  Attempting uninstall: safetensors
    Found existing installation: safetensors 0.3.0
    Uninstalling safetensors-0.3.0:
      Successfully uninstalled safetensors-0.3.0
  Attempting uninstall: fsspec
    Found existing installation: fsspec 2023.1.0
    Uninstalling fsspec-2023.1.0:
      Successfully uninstalled fsspec-2023.1.0
  Attempting uninstall: huggingface_hub
    Found existing installation: huggingface-hub 0.13.2
    Uninstalling huggingface-hub-0.13.2:
      Successfully uninstalled huggingface-hub-0.13.2
  Attempting uninstall: tokenizers
    Found existing installation: tokenizers 0.13.2
    Uninstalling tokenizers-0.13.2:
      Successfully uninstalled tokenizers-0.13.2
  Attempting uninstall: transformers
    Found existing installation: transformers 4.27.1
    Uninstalling transformers-4.27.1:
      Successfully uninstalled transformers-4.27.1
Successfully installed fsspec-2024.5.0 huggingface_hub-0.23.1 safetensors-0.4.3 tokenizers-0.19.1 transformers-4.41.0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
代码
文本

数据准备

我们将使用IMDb电影评论数据集作为示例,这个数据集包含正面和负面的电影评论。

  • 大型电影评论数据集

该数据集包含条带有二分类情感极性标签的电影评论,分为训练集和条测试集,标签均衡分布(各条正面和负面评论)。此外,还包括条未标记文档用于无监督学习。训练集和测试集中的电影不重复,确保模型不能通过记忆电影特有术语来提升性能。标记数据集中,评分<= 4的评论为负面,评分>= 7的评论为正面。数据集按正负标签分别存储在不同目录中,并提供相应的IMDb URL和词袋特征文件。

数据集地址:点击此处链接

文件结构概要

代码
文本
[1]
# 解压一下数据集文件
!cp /bohr/BERT-data-tar-k630/v1/aclImdb_v1.tar.gz .
!tar -xvzf /bohr/BERT-data-tar-k630/v1/aclImdb_v1.tar.gz
已隐藏输出
代码
文本
[2]
!tree -L 2 /aclImdb
/aclImdb
├── README
├── imdb.vocab
├── imdbEr.txt
├── test
│   ├── labeledBow.feat
│   ├── neg
│   ├── pos
│   ├── urls_neg.txt
│   └── urls_pos.txt
└── train
    ├── labeledBow.feat
    ├── neg
    ├── pos
    ├── unsup
    ├── unsupBow.feat
    ├── urls_neg.txt
    ├── urls_pos.txt
    └── urls_unsup.txt

7 directories, 11 files
代码
文本

顶目录

  • [train/]: 训练集
  • [test/]: 测试集

子目录

  • [pos/]: 正面标签评论
  • [neg/]: 负面标签评论
  • [unsup/]: 无监督数据(仅在训练集中)

IMDb URL文件

  • urls_[pos, neg, unsup].txt:
    • 包含每条评论的IMDb URL
    • 例如,标识符为200的评论将在文件的第200行有其URL
    • 仅能链接到电影的评论页面,不能直接链接到具体评论

词袋(BoW)特征

  • 存储位置: train/test 目录中,采用 .feat 文件格式
  • 格式: LIBSVM格式(ascii稀疏向量格式)
    • 示例: 0:7 表示 imdb.vocab 文件中的第一个单词(the)在该评论中出现了7次
  • 词汇表文件: imdb.vocab
代码
文本

加载数据

代码
文本
[3]
import os
import pandas as pd
from sklearn.model_selection import train_test_split
from transformers import BertTokenizer
import torch
from torch.utils.data import Dataset, DataLoader

# 定义加载数据的函数
def load_data(data_dir, split='train'):
data = {'text': [], 'label': []}
for label in ['pos', 'neg']:
label_dir = os.path.join(data_dir, split, label)
for filename in os.listdir(label_dir):
if filename.endswith('.txt'):
with open(os.path.join(label_dir, filename), 'r', encoding='utf-8') as file:
data['text'].append(file.read())
data['label'].append(1 if label == 'pos' else 0)
return pd.DataFrame(data)

# 加载训练集和测试集
train_data = load_data('/aclImdb', 'train')
test_data = load_data('/aclImdb', 'test')

# 拆分训练集和验证集
train_data, val_data = train_test_split(train_data, test_size=0.2, random_state=42)
代码
文本

数据预处理

BERT需要特定的输入格式,包括输入IDs、attention masks等。我们使用 transformers 提供的 BertTokenizer 进行数据预处理。

代码
文本
[5]
from transformers import BertForSequenceClassification, get_linear_schedule_with_warmup
import torch.optim as optim
from sklearn.metrics import accuracy_score
from tqdm import tqdm

# 加载预训练的BERT tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

# 定义数据预处理函数
def preprocess_function(examples):
return tokenizer(examples['text'].tolist(), padding="max_length", truncation=True, max_length=128, return_tensors='pt')

# 应用预处理函数
train_encodings = preprocess_function(train_data)
val_encodings = preprocess_function(val_data)
test_encodings = preprocess_function(test_data)
代码
文本

创建数据集类

为了与PyTorch DataLoader兼容,我们需要创建一个自定义数据集类。

代码
文本
[6]
# 定义IMDb数据集类
class IMDbDataset(Dataset):
def __init__(self, encodings, labels):
self.encodings = encodings
self.labels = labels

def __getitem__(self, idx):
item = {key: val[idx] for key, val in self.encodings.items()}
item['labels'] = torch.tensor(self.labels[idx])
return item

def __len__(self):
return len(self.labels)

# 创建PyTorch数据集
train_dataset = IMDbDataset(train_encodings, train_data['label'].tolist())
val_dataset = IMDbDataset(val_encodings, val_data['label'].tolist())
test_dataset = IMDbDataset(test_encodings, test_data['label'].tolist())


代码
文本

创建数据加载器

创建PyTorch的数据加载器,以便在训练过程中批量读取数据。

代码
文本
[7]
from torch.utils.data import DataLoader

# 创建数据加载器
train_dataloader = DataLoader(train_dataset, batch_size=8, shuffle=True)
val_dataloader = DataLoader(val_dataset, batch_size=8, shuffle=False)
test_dataloader = DataLoader(test_dataset, batch_size=8, shuffle=False)
代码
文本

模型定义和训练

我们使用 transformers 提供的 BertForSequenceClassification 模型,并且使用AdamW优化器和交叉熵损失函数进行训练。

代码
文本
[8]
# ---------------- Time Warning: ~10 mins -----------------
from transformers import BertForSequenceClassification, get_linear_schedule_with_warmup
import torch.optim as optim
from sklearn.metrics import accuracy_score
from tqdm import tqdm

# 加载预训练的BERT模型
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

# 将模型移动到GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# 定义优化器和学习率调度器
optimizer = optim.AdamW(model.parameters(), lr=5e-5)
total_steps = len(train_dataloader) * 3 # 3个epoch
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=total_steps)

# 训练模型
model.train()
for epoch in range(3): # 训练3个epoch
epoch_loss = 0 # 用于记录每个epoch的总损失
progress_bar = tqdm(train_dataloader, desc=f"Epoch {epoch+1}", leave=False)
for batch in progress_bar:
optimizer.zero_grad()
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
labels = batch['labels'].to(device)
outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
loss = outputs.loss
loss.backward()
optimizer.step()
scheduler.step()

epoch_loss += loss.item()
progress_bar.set_postfix(loss=loss.item())
avg_epoch_loss = epoch_loss / len(train_dataloader)
print(f"Epoch {epoch+1} completed. Average Loss: {avg_epoch_loss:.4f}")
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Epoch 1 completed. Average Loss: 0.3738                                   
Epoch 2 completed. Average Loss: 0.1900                                   
                                                                           Epoch 3 completed. Average Loss: 0.0483

代码
文本

模型评估

训练完成后,我们在测试集上评估模型的性能。

代码
文本
[9]
# 评估函数
def evaluate(dataloader):
model.eval()
predictions, true_labels = [], []

for batch in dataloader:
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
labels = batch['labels'].to(device)

with torch.no_grad():
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
logits = outputs.logits
predictions.extend(torch.argmax(logits, dim=-1).tolist())
true_labels.extend(labels.tolist())

return accuracy_score(true_labels, predictions)

# 在验证集和测试集上评估模型
val_accuracy = evaluate(val_dataloader)
test_accuracy = evaluate(test_dataloader)

print(f"Validation Accuracy: {val_accuracy:.4f}")
print(f"Test Accuracy: {test_accuracy:.4f}")
Validation Accuracy: 0.8908
Test Accuracy: 0.8815
代码
文本

总结

以上代码展示了如何加载并预处理IMDb数据集,然后使用BERT模型进行微调。通过加载数据、预处理、定义数据集类、创建数据加载器、训练和评估模型,我们可以完成一个完整的BERT微调流程。

代码
文本
中文
中文
已赞1
推荐阅读
公开
DeePTB V2.1 快速上手指南 | 训练 Silicon 的紧束缚模型
Deep LearningDFTDeePTBHamiltonian哈密顿量Tight-BindingTB紧束缚模型
Deep LearningDFTDeePTBHamiltonian哈密顿量Tight-BindingTB紧束缚模型
顾强强
更新于 2024-08-15
1 赞1 转存文件
公开
浅谈模型训练后如何进行高效推理
PyTorchDeep LearningTutorial
PyTorchDeep LearningTutorial
jixh@dp.tech
发布于 2023-09-24
4 赞1 评论