空间站广场

论文

Notebooks

比赛

课程

Apps

我的主页

我的Notebooks

我的论文库

我的足迹

我的工作空间

任务

节点

文件

数据集

镜像

项目

数据库

公开

BERT预训练模型微调教程

中文

Letian

爱学习的王一博

更新于 2024-07-13

推荐镜像 :Basic Image:bohrium-notebook:2023-04-07

推荐机型 :c2_m4_cpu

数据集

BERT-data2(v1)

BERT预训练模型微调教程

©️ Copyright 2024 @ Authors
作者： 陈乐天📨
日期：2024-05-21
共享协议：本作品采用知识共享署名-非商业性使用-相同方式共享 4.0 国际许可协议进行许可。
快速开始：点击上方的 开始连接 按钮，选择 bohrium-notebook:2023-04-07镜像及 c12_m46_1 * NVIDIA GPU B 节点配置，稍等片刻即可运行。

代码

文本

BERT（Bidirectional Encoder Representations from Transformers）是一种基于Transformer的语言表示模型，通过无监督的方式进行预训练，然后可以通过微调在各种自然语言处理任务上取得优异的表现。下面是一个详细的教程，介绍如何基于BERT模型进行微调。

准备工作

首先，我们需要安装必要的Python库，包括 transformers 和 datasets。我们可以通过以下命令安装它们：

代码

文本

[1]

# 设置一下网络IP

import os

os.environ['HTTP_PROXY'] = 'http://ga.dp.tech:8118'

os.environ['HTTPS_PROXY'] = 'http://ga.dp.tech:8118'

代码

文本

[2]

# 更新一下库

! pip install --upgrade transformers huggingface_hub

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Requirement already satisfied: transformers in /opt/conda/lib/python3.8/site-packages (4.27.1)
Collecting transformers
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/07/78/c23e1c70b89f361d855a5d0a19b229297f6456961f9a1afa9a69cd5a70c3/transformers-4.41.0-py3-none-any.whl (9.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 9.1/9.1 MB 43.6 MB/s eta 0:00:0000:0100:01
Requirement already satisfied: huggingface_hub in /opt/conda/lib/python3.8/site-packages (0.13.2)
Collecting huggingface_hub
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/92/27/1a30d8082ef3c8615ae198b9d451fafffdab815b96727ec3c06befc27ebe/huggingface_hub-0.23.1-py3-none-any.whl (401 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 401.3/401.3 kB 66.6 MB/s eta 0:00:00
Requirement already satisfied: numpy>=1.17 in /opt/conda/lib/python3.8/site-packages (from transformers) (1.22.4)
Requirement already satisfied: filelock in /opt/conda/lib/python3.8/site-packages (from transformers) (3.9.0)
Requirement already satisfied: pyyaml>=5.1 in /opt/conda/lib/python3.8/site-packages (from transformers) (6.0)
Requirement already satisfied: tqdm>=4.27 in /opt/conda/lib/python3.8/site-packages (from transformers) (4.64.1)
Requirement already satisfied: regex!=2019.12.17 in /opt/conda/lib/python3.8/site-packages (from transformers) (2022.6.2)
Collecting safetensors>=0.4.1
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/41/ae/7b9e79467ab81884b457214eace4b20214e286277b75c47150ff297c8561/safetensors-0.4.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 MB 79.9 MB/s eta 0:00:00
Collecting tokenizers<0.20,>=0.19
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/18/0d/ee99f50407788149bc9eddae6af0b4016865d67fb687730d151683b13b80/tokenizers-0.19.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.6/3.6 MB 44.7 MB/s eta 0:00:0000:0100:01
Requirement already satisfied: packaging>=20.0 in /opt/conda/lib/python3.8/site-packages (from transformers) (23.0)
Requirement already satisfied: requests in /opt/conda/lib/python3.8/site-packages (from transformers) (2.28.2)
Collecting fsspec>=2023.5.0
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/ba/a3/16e9fe32187e9c8bc7f9b7bcd9728529faa725231a0c96f2f98714ff2fc5/fsspec-2024.5.0-py3-none-any.whl (316 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 316.1/316.1 kB 63.7 MB/s eta 0:00:00
Requirement already satisfied: typing-extensions>=3.7.4.3 in /opt/conda/lib/python3.8/site-packages (from huggingface_hub) (4.5.0)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/conda/lib/python3.8/site-packages (from requests->transformers) (1.26.14)
Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.8/site-packages (from requests->transformers) (3.4)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.8/site-packages (from requests->transformers) (2022.12.7)
Requirement already satisfied: charset-normalizer<4,>=2 in /opt/conda/lib/python3.8/site-packages (from requests->transformers) (3.0.1)
Installing collected packages: safetensors, fsspec, huggingface_hub, tokenizers, transformers
  Attempting uninstall: safetensors
    Found existing installation: safetensors 0.3.0
    Uninstalling safetensors-0.3.0:
      Successfully uninstalled safetensors-0.3.0
  Attempting uninstall: fsspec
    Found existing installation: fsspec 2023.1.0
    Uninstalling fsspec-2023.1.0:
      Successfully uninstalled fsspec-2023.1.0
  Attempting uninstall: huggingface_hub
    Found existing installation: huggingface-hub 0.13.2
    Uninstalling huggingface-hub-0.13.2:
      Successfully uninstalled huggingface-hub-0.13.2
  Attempting uninstall: tokenizers
    Found existing installation: tokenizers 0.13.2
    Uninstalling tokenizers-0.13.2:
      Successfully uninstalled tokenizers-0.13.2
  Attempting uninstall: transformers
    Found existing installation: transformers 4.27.1
    Uninstalling transformers-4.27.1:
      Successfully uninstalled transformers-4.27.1
Successfully installed fsspec-2024.5.0 huggingface_hub-0.23.1 safetensors-0.4.3 tokenizers-0.19.1 transformers-4.41.0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

代码

文本

数据准备

我们将使用IMDb电影评论数据集作为示例，这个数据集包含正面和负面的电影评论。

大型电影评论数据集

该数据集包含 $50, 000$ 条带有二分类情感极性标签的电影评论，分为 $25, 000 条$ 训练集和 $25, 000$ 条测试集，标签均衡分布（各 $25, 000$ 条正面和负面评论）。此外，还包括 $50, 000$ 条未标记文档用于无监督学习。训练集和测试集中的电影不重复，确保模型不能通过记忆电影特有术语来提升性能。标记数据集中，评分<= 4的评论为负面，评分>= 7的评论为正面。数据集按正负标签分别存储在不同目录中，并提供相应的IMDb URL和词袋特征文件。

数据集地址：点击此处链接

文件结构概要

代码

文本

[1]

# 解压一下数据集文件

!cp /bohr/BERT-data-tar-k630/v1/aclImdb_v1.tar.gz .

!tar -xvzf /bohr/BERT-data-tar-k630/v1/aclImdb_v1.tar.gz

已隐藏输出

代码

文本

[2]

!tree -L 2 /aclImdb

/aclImdb
├── README
├── imdb.vocab
├── imdbEr.txt
├── test
│   ├── labeledBow.feat
│   ├── neg
│   ├── pos
│   ├── urls_neg.txt
│   └── urls_pos.txt
└── train
    ├── labeledBow.feat
    ├── neg
    ├── pos
    ├── unsup
    ├── unsupBow.feat
    ├── urls_neg.txt
    ├── urls_pos.txt
    └── urls_unsup.txt

7 directories, 11 files

代码

文本

顶目录

[train/]: 训练集
[test/]: 测试集

子目录

[pos/]: 正面标签评论
[neg/]: 负面标签评论
[unsup/]: 无监督数据（仅在训练集中）

IMDb URL文件

urls_[pos, neg, unsup].txt:
- 包含每条评论的IMDb URL
- 例如，标识符为200的评论将在文件的第200行有其URL
- 仅能链接到电影的评论页面，不能直接链接到具体评论

词袋（BoW）特征

存储位置: train/test 目录中，采用 .feat 文件格式
格式: LIBSVM格式（ascii稀疏向量格式）
- 示例: 0:7 表示 imdb.vocab 文件中的第一个单词（the）在该评论中出现了7次
词汇表文件: imdb.vocab

代码

文本

加载数据

代码

文本

[3]

import os

import pandas as pd

from sklearn.model_selection import train_test_split

from transformers import BertTokenizer

import torch

from torch.utils.data import Dataset, DataLoader

# 定义加载数据的函数

def load_data(data_dir, split='train'):

data = {'text': [], 'label': []}

for label in ['pos', 'neg']:

label_dir = os.path.join(data_dir, split, label)

for filename in os.listdir(label_dir):

if filename.endswith('.txt'):

with open(os.path.join(label_dir, filename), 'r', encoding='utf-8') as file:

data['text'].append(file.read())

data['label'].append(1 if label == 'pos' else 0)

return pd.DataFrame(data)

# 加载训练集和测试集

train_data = load_data('/aclImdb', 'train')

test_data = load_data('/aclImdb', 'test')

# 拆分训练集和验证集

train_data, val_data = train_test_split(train_data, test_size=0.2, random_state=42)

代码

文本

数据预处理

BERT需要特定的输入格式，包括输入IDs、attention masks等。我们使用 transformers 提供的 BertTokenizer 进行数据预处理。

代码

文本

[5]

from transformers import BertForSequenceClassification, get_linear_schedule_with_warmup

import torch.optim as optim

from sklearn.metrics import accuracy_score

from tqdm import tqdm

# 加载预训练的BERT tokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

# 定义数据预处理函数

def preprocess_function(examples):

return tokenizer(examples['text'].tolist(), padding="max_length", truncation=True, max_length=128, return_tensors='pt')

# 应用预处理函数

train_encodings = preprocess_function(train_data)

val_encodings = preprocess_function(val_data)

test_encodings = preprocess_function(test_data)

代码

文本

创建数据集类

为了与PyTorch DataLoader兼容，我们需要创建一个自定义数据集类。

代码

文本

[6]

# 定义IMDb数据集类

class IMDbDataset(Dataset):

def __init__(self, encodings, labels):

self.encodings = encodings

self.labels = labels

def __getitem__(self, idx):

item = {key: val[idx] for key, val in self.encodings.items()}

item['labels'] = torch.tensor(self.labels[idx])

return item

def __len__(self):

return len(self.labels)

# 创建PyTorch数据集

train_dataset = IMDbDataset(train_encodings, train_data['label'].tolist())

val_dataset = IMDbDataset(val_encodings, val_data['label'].tolist())

test_dataset = IMDbDataset(test_encodings, test_data['label'].tolist())

代码

文本

创建数据加载器

创建PyTorch的数据加载器，以便在训练过程中批量读取数据。

代码

文本

[7]

from torch.utils.data import DataLoader

# 创建数据加载器

train_dataloader = DataLoader(train_dataset, batch_size=8, shuffle=True)

val_dataloader = DataLoader(val_dataset, batch_size=8, shuffle=False)

test_dataloader = DataLoader(test_dataset, batch_size=8, shuffle=False)

代码

文本

模型定义和训练

我们使用 transformers 提供的 BertForSequenceClassification 模型，并且使用AdamW优化器和交叉熵损失函数进行训练。

代码

文本

[8]

# ---------------- Time Warning: ~10 mins -----------------

from transformers import BertForSequenceClassification, get_linear_schedule_with_warmup

import torch.optim as optim

from sklearn.metrics import accuracy_score

from tqdm import tqdm

# 加载预训练的BERT模型

model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

# 将模型移动到GPU

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model.to(device)

# 定义优化器和学习率调度器

optimizer = optim.AdamW(model.parameters(), lr=5e-5)

total_steps = len(train_dataloader) * 3 # 3个epoch

scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=total_steps)

# 训练模型

model.train()

for epoch in range(3): # 训练3个epoch

epoch_loss = 0 # 用于记录每个epoch的总损失

progress_bar = tqdm(train_dataloader, desc=f"Epoch {epoch+1}", leave=False)

for batch in progress_bar:

optimizer.zero_grad()

input_ids = batch['input_ids'].to(device)

attention_mask = batch['attention_mask'].to(device)

labels = batch['labels'].to(device)

outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)

loss = outputs.loss

loss.backward()

optimizer.step()

scheduler.step()

epoch_loss += loss.item()

progress_bar.set_postfix(loss=loss.item())

avg_epoch_loss = epoch_loss / len(train_dataloader)

print(f"Epoch {epoch+1} completed. Average Loss: {avg_epoch_loss:.4f}")

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Epoch 1 completed. Average Loss: 0.3738                                   
Epoch 2 completed. Average Loss: 0.1900                                   
                                                                           Epoch 3 completed. Average Loss: 0.0483

代码

文本

模型评估

训练完成后，我们在测试集上评估模型的性能。

代码

文本

[9]

# 评估函数

def evaluate(dataloader):

model.eval()

predictions, true_labels = [], []

for batch in dataloader:

input_ids = batch['input_ids'].to(device)

attention_mask = batch['attention_mask'].to(device)

labels = batch['labels'].to(device)

with torch.no_grad():

outputs = model(input_ids=input_ids, attention_mask=attention_mask)

logits = outputs.logits

predictions.extend(torch.argmax(logits, dim=-1).tolist())

true_labels.extend(labels.tolist())

return accuracy_score(true_labels, predictions)

# 在验证集和测试集上评估模型

val_accuracy = evaluate(val_dataloader)

test_accuracy = evaluate(test_dataloader)

print(f"Validation Accuracy: {val_accuracy:.4f}")

print(f"Test Accuracy: {test_accuracy:.4f}")

Validation Accuracy: 0.8908
Test Accuracy: 0.8815

代码

文本

总结

以上代码展示了如何加载并预处理IMDb数据集，然后使用BERT模型进行微调。通过加载数据、预处理、定义数据集类、创建数据加载器、训练和评估模型，我们可以完成一个完整的BERT微调流程。

代码

文本

中文

已赞1