



BERT预训练模型微调教程
©️ Copyright 2024 @ Authors
作者:
陈乐天📨
日期:2024-05-21
共享协议:本作品采用知识共享署名-非商业性使用-相同方式共享 4.0 国际许可协议进行许可。
快速开始:点击上方的 开始连接 按钮,选择 bohrium-notebook:2023-04-07镜像及 c12_m46_1 * NVIDIA GPU B 节点配置,稍等片刻即可运行。
BERT(Bidirectional Encoder Representations from Transformers)是一种基于Transformer的语言表示模型,通过无监督的方式进行预训练,然后可以通过微调在各种自然语言处理任务上取得优异的表现。下面是一个详细的教程,介绍如何基于BERT模型进行微调。
准备工作
首先,我们需要安装必要的Python库,包括 transformers
和 datasets
。我们可以通过以下命令安装它们:
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple Requirement already satisfied: transformers in /opt/conda/lib/python3.8/site-packages (4.27.1) Collecting transformers Downloading https://pypi.tuna.tsinghua.edu.cn/packages/07/78/c23e1c70b89f361d855a5d0a19b229297f6456961f9a1afa9a69cd5a70c3/transformers-4.41.0-py3-none-any.whl (9.1 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 9.1/9.1 MB 43.6 MB/s eta 0:00:0000:0100:01 Requirement already satisfied: huggingface_hub in /opt/conda/lib/python3.8/site-packages (0.13.2) Collecting huggingface_hub Downloading https://pypi.tuna.tsinghua.edu.cn/packages/92/27/1a30d8082ef3c8615ae198b9d451fafffdab815b96727ec3c06befc27ebe/huggingface_hub-0.23.1-py3-none-any.whl (401 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 401.3/401.3 kB 66.6 MB/s eta 0:00:00 Requirement already satisfied: numpy>=1.17 in /opt/conda/lib/python3.8/site-packages (from transformers) (1.22.4) Requirement already satisfied: filelock in /opt/conda/lib/python3.8/site-packages (from transformers) (3.9.0) Requirement already satisfied: pyyaml>=5.1 in /opt/conda/lib/python3.8/site-packages (from transformers) (6.0) Requirement already satisfied: tqdm>=4.27 in /opt/conda/lib/python3.8/site-packages (from transformers) (4.64.1) Requirement already satisfied: regex!=2019.12.17 in /opt/conda/lib/python3.8/site-packages (from transformers) (2022.6.2) Collecting safetensors>=0.4.1 Downloading https://pypi.tuna.tsinghua.edu.cn/packages/41/ae/7b9e79467ab81884b457214eace4b20214e286277b75c47150ff297c8561/safetensors-0.4.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 MB 79.9 MB/s eta 0:00:00 Collecting tokenizers<0.20,>=0.19 Downloading https://pypi.tuna.tsinghua.edu.cn/packages/18/0d/ee99f50407788149bc9eddae6af0b4016865d67fb687730d151683b13b80/tokenizers-0.19.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.6/3.6 MB 44.7 MB/s eta 0:00:0000:0100:01 Requirement already satisfied: packaging>=20.0 in /opt/conda/lib/python3.8/site-packages (from transformers) (23.0) Requirement already satisfied: requests in /opt/conda/lib/python3.8/site-packages (from transformers) (2.28.2) Collecting fsspec>=2023.5.0 Downloading https://pypi.tuna.tsinghua.edu.cn/packages/ba/a3/16e9fe32187e9c8bc7f9b7bcd9728529faa725231a0c96f2f98714ff2fc5/fsspec-2024.5.0-py3-none-any.whl (316 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 316.1/316.1 kB 63.7 MB/s eta 0:00:00 Requirement already satisfied: typing-extensions>=3.7.4.3 in /opt/conda/lib/python3.8/site-packages (from huggingface_hub) (4.5.0) Requirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/conda/lib/python3.8/site-packages (from requests->transformers) (1.26.14) Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.8/site-packages (from requests->transformers) (3.4) Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.8/site-packages (from requests->transformers) (2022.12.7) Requirement already satisfied: charset-normalizer<4,>=2 in /opt/conda/lib/python3.8/site-packages (from requests->transformers) (3.0.1) Installing collected packages: safetensors, fsspec, huggingface_hub, tokenizers, transformers Attempting uninstall: safetensors Found existing installation: safetensors 0.3.0 Uninstalling safetensors-0.3.0: Successfully uninstalled safetensors-0.3.0 Attempting uninstall: fsspec Found existing installation: fsspec 2023.1.0 Uninstalling fsspec-2023.1.0: Successfully uninstalled fsspec-2023.1.0 Attempting uninstall: huggingface_hub Found existing installation: huggingface-hub 0.13.2 Uninstalling huggingface-hub-0.13.2: Successfully uninstalled huggingface-hub-0.13.2 Attempting uninstall: tokenizers Found existing installation: tokenizers 0.13.2 Uninstalling tokenizers-0.13.2: Successfully uninstalled tokenizers-0.13.2 Attempting uninstall: transformers Found existing installation: transformers 4.27.1 Uninstalling transformers-4.27.1: Successfully uninstalled transformers-4.27.1 Successfully installed fsspec-2024.5.0 huggingface_hub-0.23.1 safetensors-0.4.3 tokenizers-0.19.1 transformers-4.41.0 WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
数据准备
我们将使用IMDb电影评论数据集作为示例,这个数据集包含正面和负面的电影评论。
- 大型电影评论数据集
该数据集包含条带有二分类情感极性标签的电影评论,分为训练集和条测试集,标签均衡分布(各条正面和负面评论)。此外,还包括条未标记文档用于无监督学习。训练集和测试集中的电影不重复,确保模型不能通过记忆电影特有术语来提升性能。标记数据集中,评分<= 4的评论为负面,评分>= 7的评论为正面。数据集按正负标签分别存储在不同目录中,并提供相应的IMDb URL和词袋特征文件。
数据集地址:点击此处链接
文件结构概要
/aclImdb ├── README ├── imdb.vocab ├── imdbEr.txt ├── test │ ├── labeledBow.feat │ ├── neg │ ├── pos │ ├── urls_neg.txt │ └── urls_pos.txt └── train ├── labeledBow.feat ├── neg ├── pos ├── unsup ├── unsupBow.feat ├── urls_neg.txt ├── urls_pos.txt └── urls_unsup.txt 7 directories, 11 files
顶目录
- [train/]: 训练集
- [test/]: 测试集
子目录
- [pos/]: 正面标签评论
- [neg/]: 负面标签评论
- [unsup/]: 无监督数据(仅在训练集中)
IMDb URL文件
- urls_[pos, neg, unsup].txt:
- 包含每条评论的IMDb URL
- 例如,标识符为200的评论将在文件的第200行有其URL
- 仅能链接到电影的评论页面,不能直接链接到具体评论
词袋(BoW)特征
- 存储位置: train/test 目录中,采用 .feat 文件格式
- 格式: LIBSVM格式(ascii稀疏向量格式)
- 示例:
0:7
表示imdb.vocab
文件中的第一个单词(the)在该评论中出现了7次
- 示例:
- 词汇表文件: imdb.vocab
加载数据
数据预处理
BERT需要特定的输入格式,包括输入IDs、attention masks等。我们使用 transformers
提供的 BertTokenizer
进行数据预处理。
创建数据集类
为了与PyTorch DataLoader兼容,我们需要创建一个自定义数据集类。
创建数据加载器
创建PyTorch的数据加载器,以便在训练过程中批量读取数据。
模型定义和训练
我们使用 transformers
提供的 BertForSequenceClassification
模型,并且使用AdamW优化器和交叉熵损失函数进行训练。
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. Epoch 1 completed. Average Loss: 0.3738 Epoch 2 completed. Average Loss: 0.1900 Epoch 3 completed. Average Loss: 0.0483
模型评估
训练完成后,我们在测试集上评估模型的性能。
Validation Accuracy: 0.8908 Test Accuracy: 0.8815
总结
以上代码展示了如何加载并预处理IMDb数据集,然后使用BERT模型进行微调。通过加载数据、预处理、定义数据集类、创建数据加载器、训练和评估模型,我们可以完成一个完整的BERT微调流程。



