Bohrium
robot
新建

空间站广场

论文
Notebooks
比赛
课程
Apps
我的主页
我的Notebooks
我的论文库
我的足迹

我的工作空间

任务
节点
文件
数据集
镜像
项目
数据库
公开
AI4S Cup-IRMI tutorial | 基于红外光谱的分子识别副本
AI4S Cup
Tutorial
AI4SCUP-IRMI
中文
AI4S CupTutorialAI4SCUP-IRMI中文
微信用户RDrE
发布于 2023-12-05
推荐镜像 :Third-party software:ai4s-cup-0.1
推荐机型 :c12_m92_1 * NVIDIA V100
赞 1
1
6
AI4S Cup-IRMI tutorial | 基于红外光谱的分子识别
比赛内容
数据介绍
提交格式
使用镜像
推荐GPU
baselines
比赛数据
训练模型
自定义collator
定义一些模型基础层
进行模型定义
进行训练
输出submission文件
评测指标
评测代码

AI4S Cup-IRMI tutorial | 基于红外光谱的分子识别

alt image.png

报名地址: https://nb.bohrium.dp.tech/competitions/detail/3473441128?tab=introduce

比赛内容

基于红外光谱预测的分子预测:在本次竞赛中,我们希望参赛者实现基于分子的红外光谱反推分子结构(SMILES表达)的机器学习算法。

数据介绍

本次赛题针对光谱结构反演,这也是目前学界和业界非常关注的课题之一。

  • 数据量约13万分子。
  • training set/test set = 8:2
  • 输入是红外光谱数据
  • 需要预测:按照置信度排序给出10个和这个光谱匹配的分子SMILES表达式

提交格式

需提交可运行的notebook推理代码,其中需要运行成功生成submission.csv文件在指定的路径。

使用镜像

ai4s-cup-0.1

推荐GPU

16G V00


代码
文本

baselines

代码
文本
[1]
! ls /data
! pip install lmdb
simple_baseline
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Requirement already satisfied: lmdb in /opt/conda/lib/python3.8/site-packages (1.3.0)
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
代码
文本

比赛数据

此次比赛使用lmdb文件来存储数据,下面提供了一个dataset类来帮助选手们顺利读取训练数据。

代码
文本
[2]
import os
import lmdb
import torch
import pickle
from rdkit import Chem
from functools import lru_cache
from torch.utils.data import Dataset


def get_canonical_smile(testsmi):
try:
mol = Chem.MolFromSmiles(testsmi)
return Chem.MolToSmiles(mol)
except:
print("Cannot convert {} to canonical smiles")
return testsmi


class IRDataset(Dataset):
def __init__(self, db_path):
self.db_path = db_path
assert os.path.isfile(self.db_path), "{} not found".format(self.db_path)
env = self.connect_db(self.db_path)
with env.begin() as txn:
self._keys = list(txn.cursor().iternext(values=False))

def connect_db(self, lmdb_path, save_to_self=False):
env = lmdb.open(
lmdb_path,
subdir=False,
readonly=True,
lock=False,
readahead=False,
meminit=False,
max_readers=256,
)
if not save_to_self:
return env
else:
self.env = env

def __len__(self):
return len(self._keys)

@lru_cache(maxsize=16)
def __getitem__(self, idx):
if not hasattr(self, "env"):
self.connect_db(self.db_path, save_to_self=True)
key = self._keys[idx]
datapoint_pickled = self.env.begin().get(key)
data = pickle.loads(datapoint_pickled)
if "smi" in data.keys():
data["smi"] = get_canonical_smile(data["smi"])
return data
/opt/conda/lib/python3.8/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
代码
文本

每一条训练数据是一个包含两个key “smi” 和 “ir” 的dict,其中,“ir”是我们作为输入的红外光谱信息,而 "smi" 是需要预测的分子信息,它是个字符串。

下面我们会给一个例子,打印出数据集的第一条训练数据,并且把红外光谱以曲线图的形式画出来,给大家一个直观的理解。

代码
文本
[3]
import matplotlib. pyplot as plt


def draw_ir(ir):
x = ir[:, 0]
y = ir[:, 1]
plt.plot(x, y)
plt.xlabel('X axis')
plt.ylabel('Y axis')
plt.title('Curve Plot')
plt.savefig('curve_plot.png')
plt.show()

train_db_path = "/bohr/AI4SCUP-IRMI-baseline-shdv/v2/train.small.lmdb"
train_datast = IRDataset(train_db_path)
print("第一条训练数据包含的内容是:")
print(train_datast[0])
draw_ir(train_datast[0]["ir"])
代码
文本

训练模型

下面,我们将会抛砖引玉,构建一个简单的小改过的Transformer模型用于训练。

自定义collator

代码
文本
[4]
class MyCollator(object):
def __init__(self, **kwargs):
self.tokenizer = (
kwargs.pop("tokenizer") if "tokenizer" in kwargs.keys() else None
)
assert self.tokenizer is not None
self.max_length = (
kwargs.pop("max_length") if "max_length" in kwargs.keys() else 512
)

def __call__(self, examples):
input = {}
smi = []
ir = []
for i in examples:
if "smi" in i.keys():
smi.append(i["smi"])
ir.append(i["ir"][:, 1])
if len(smi) > 0:
output = self.tokenizer(
smi,
padding=True,
max_length=self.max_length,
truncation=True,
return_tensors="pt",
)
input["labels"] = output["input_ids"][:, 1:].cuda()
input["decoder_input_ids"] = output["input_ids"][:, :-1].cuda()
input["ir"] = torch.tensor(ir, dtype=torch.float32).cuda()
return input
代码
文本

定义一些模型基础层

代码
文本
[5]
import json
import math
import torch
from torch import nn
from transformers import BartTokenizer, BartForConditionalGeneration, BartConfig


class Linear(nn.Linear):
def __init__(
self,
d_in: int,
d_out: int,
bias: bool = True,
init: str = "default",
):
super(Linear, self).__init__(d_in, d_out, bias=bias)

self.use_bias = bias

if self.use_bias:
with torch.no_grad():
self.bias.fill_(0)

if init == "default":
self._trunc_normal_init(1.0)
elif init == "relu":
self._trunc_normal_init(2.0)
elif init == "glorot":
self._glorot_uniform_init()
elif init == "gating":
self._zero_init(self.use_bias)
elif init == "normal":
self._normal_init()
elif init == "final":
self._zero_init(False)
elif init == "jax":
self._jax_init()
else:
raise ValueError("Invalid init method.")

def _trunc_normal_init(self, scale=1.0):
# Constant from scipy.stats.truncnorm.std(a=-2, b=2, loc=0., scale=1.)
TRUNCATED_NORMAL_STDDEV_FACTOR = 0.87962566103423978
_, fan_in = self.weight.shape
scale = scale / max(1, fan_in)
std = (scale**0.5) / TRUNCATED_NORMAL_STDDEV_FACTOR
nn.init.trunc_normal_(self.weight, mean=0.0, std=std)

def _glorot_uniform_init(self):
nn.init.xavier_uniform_(self.weight, gain=1)

def _zero_init(self, use_bias=True):
with torch.no_grad():
self.weight.fill_(0.0)
if use_bias:
with torch.no_grad():
self.bias.fill_(1.0)

def _normal_init(self):
torch.nn.init.kaiming_normal_(self.weight, nonlinearity="linear")

def _jax_init(self):
input_size = self.weight.shape[-1]
std = math.sqrt(1 / input_size)
nn.init.trunc_normal_(self.weight, std=std, a=-2.0 * std, b=2.0 * std)


class MLP(nn.Module):
def __init__(
self,
d_in,
n_layers,
d_hidden,
d_out,
activation=nn.ReLU(),
bias=True,
final_init="final",
):
super(MLP, self).__init__()
layers = [Linear(d_in, d_hidden, bias), activation]
for _ in range(n_layers):
layers += [Linear(d_hidden, d_hidden, bias), activation]
layers.append(Linear(d_hidden, d_out, bias, init=final_init))
self.main = nn.Sequential(*layers)

def forward(self, x):
return self.main(x)
代码
文本

进行模型定义

直觉来说,一个transfomrer结构,输入ir输出SMILES就可以基本的解决这个问题。

但是由于ir是一个长度为3,000的tensor,所以需要使用一些MLP层,把ir变成encoder embedding

代码
文本
[6]
class MolecularGenerator(nn.Module):
def __init__(self, config_json_path, tokenizer_path):
super().__init__()
with open(config_json_path, "r") as f:
self.model = BartForConditionalGeneration(
config=BartConfig(**json.loads(f.read()))
)
self.tokenizer = BartTokenizer.from_pretrained(tokenizer_path)
self.mlp = MLP(60, 3, 512, 768, activation=nn.ReLU())

def ir_forward(self, ir):
ir = self.mlp(ir.reshape([ir.shape[0], 50, 60]))
return ir

def forward(self, **kwargs):
ir = self.ir_forward(kwargs.pop("ir"))
return self.model(inputs_embeds=ir, **kwargs)

def infer(self, num_beams=10, num_return_sequences=None, max_length=512, **kwargs):
ir = self.ir_forward(kwargs.pop("ir"))
result = self.model.generate(
max_length=max_length,
num_beams=num_beams,
num_return_sequences=num_beams
if num_return_sequences is None
else num_return_sequences,
inputs_embeds=ir,
decoder_start_token_id=0,
)
smiles = [
self.tokenizer.decode(i).replace("<pad>", "").replace("<s>", "").replace("</s>", "")
for i in result
]
return smiles

def load_weights(self, path):
if path is not None:
model_dict = torch.load(path, map_location=torch.device("cpu"))
self.load_state_dict(model_dict)
代码
文本

进行训练

因为只是示范,为了不浪费算力,只在small数据集上进行 20 steps 训练

代码
文本
[7]
import math
import os
import torch
import time
from torch.utils.data import DataLoader
from transformers import (
AdamW,
SchedulerType,
get_scheduler,
set_seed,
)


def main(model_path=None,
config_json_path="/bohr/AI4SCUP-IRMI-baseline-shdv/v2/simple_baseline/configs/bart.json",
tokenizer_path="/bohr/AI4SCUP-IRMI-baseline-shdv/v2/simple_baseline/tokenizer-smiles-bart/",
model_weight=None,
per_device_train_batch_size=4,
learning_rate=5e-5,
weight_decay=0,
num_train_epochs=None,
max_train_steps=20,
gradient_accumulation_steps=1,
lr_scheduler_type="linear",
num_warmup_epochs=0,
output_dir="/data",
seed=42,
block_size=512,
):
if output_dir is not None:
os.makedirs(output_dir, exist_ok=True)
if seed is not None:
set_seed(seed)

model = MolecularGenerator(
config_json_path=config_json_path,
tokenizer_path=tokenizer_path,
)
model.load_weights(model_weight)
model = model.cuda()
train_dataloader = DataLoader(
train_datast,
shuffle=True,
collate_fn=MyCollator(tokenizer=model.tokenizer, max_length=block_size),
batch_size=per_device_train_batch_size,
)

no_decay = ["bias", "LayerNorm.weight"]
optimizer_grouped_parameters = [
{
"params": [
p
for n, p in model.named_parameters()
if not any(nd in n for nd in no_decay)
],
"weight_decay": weight_decay,
},
{
"params": [
p
for n, p in model.named_parameters()
if any(nd in n for nd in no_decay)
],
"weight_decay": 0.0,
},
]
optimizer = AdamW(optimizer_grouped_parameters, lr=learning_rate)

model.train()

num_update_steps_per_epoch = math.ceil(
len(train_dataloader) / gradient_accumulation_steps
)
if max_train_steps is None:
max_train_steps = num_train_epochs * num_update_steps_per_epoch
else:
num_train_epochs = math.ceil(
max_train_steps / num_update_steps_per_epoch
)

lr_scheduler = get_scheduler(
name=lr_scheduler_type,
optimizer=optimizer,
num_warmup_steps=num_warmup_epochs * len(train_dataloader),
num_training_steps=max_train_steps,
)
start_epoch = 0
completed_steps = 0

for epoch in range(start_epoch, num_train_epochs):
train_loss_sum = 0.0
start = time.time()
for step, batch in enumerate(train_dataloader):
outputs = model(**batch)
loss = outputs.loss
loss = loss / gradient_accumulation_steps
loss.backward()
train_loss_sum += loss.item()
if (
step % gradient_accumulation_steps == 0
or step == len(train_dataloader) - 1
):
optimizer.step()
lr_scheduler.step()
optimizer.zero_grad()
completed_steps += 1

if completed_steps >= max_train_steps:
break
print_step = 1
if (step + 1) % print_step == 0:
print("Epoch {:04d} | Step {:04d}/{:04d} | Loss {:.4f} | Time {:.4f}".format(
epoch + 1,
step + 1,
len(train_dataloader),
train_loss_sum / (step + 1),
time.time() - start,
))
print("Learning rate = {}".format(
optimizer.state_dict()["param_groups"][0]["lr"]
))
torch.save(model.state_dict(), os.path.join(output_dir, "final_{}.pt".format(epoch)))


main()
/opt/conda/lib/python3.8/site-packages/transformers/optimization.py:411: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
<ipython-input-4-deda29c3cfdf>:29: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at  ../torch/csrc/utils/tensor_new.cpp:201.)
  input["ir"] = torch.tensor(ir, dtype=torch.float32).cuda()
Epoch 0001 | Step 0001/0025 | Loss 5.8578 | Time 2.0328
Learning rate = 4.75e-05
Epoch 0001 | Step 0002/0025 | Loss 4.8940 | Time 2.4123
Learning rate = 4.5e-05
Epoch 0001 | Step 0003/0025 | Loss 4.4027 | Time 2.7918
Learning rate = 4.25e-05
Epoch 0001 | Step 0004/0025 | Loss 4.0437 | Time 3.1705
Learning rate = 4e-05
Epoch 0001 | Step 0005/0025 | Loss 3.8260 | Time 3.5467
Learning rate = 3.7500000000000003e-05
Epoch 0001 | Step 0006/0025 | Loss 3.6969 | Time 3.9215
Learning rate = 3.5e-05
Epoch 0001 | Step 0007/0025 | Loss 3.5683 | Time 4.3173
Learning rate = 3.2500000000000004e-05
Epoch 0001 | Step 0008/0025 | Loss 3.5376 | Time 4.7043
Learning rate = 3e-05
Epoch 0001 | Step 0009/0025 | Loss 3.4501 | Time 5.0899
Learning rate = 2.7500000000000004e-05
Epoch 0001 | Step 0010/0025 | Loss 3.3715 | Time 5.4703
Learning rate = 2.5e-05
Epoch 0001 | Step 0011/0025 | Loss 3.3231 | Time 5.7715
Learning rate = 2.25e-05
Epoch 0001 | Step 0012/0025 | Loss 3.3309 | Time 6.1494
Learning rate = 2e-05
Epoch 0001 | Step 0013/0025 | Loss 3.2512 | Time 6.4937
Learning rate = 1.75e-05
Epoch 0001 | Step 0014/0025 | Loss 3.1804 | Time 6.8662
Learning rate = 1.5e-05
Epoch 0001 | Step 0015/0025 | Loss 3.1801 | Time 7.2430
Learning rate = 1.25e-05
Epoch 0001 | Step 0016/0025 | Loss 3.1528 | Time 7.6188
Learning rate = 1e-05
Epoch 0001 | Step 0017/0025 | Loss 3.1445 | Time 7.9922
Learning rate = 7.5e-06
Epoch 0001 | Step 0018/0025 | Loss 3.1076 | Time 8.3664
Learning rate = 5e-06
Epoch 0001 | Step 0019/0025 | Loss 3.0716 | Time 8.7404
Learning rate = 2.5e-06
代码
文本

输出submission文件

输出submission.csv, 格式:

第一行是:

index,rank1,rank2,rank3,rank4,rank5,rank6,rank7,rank8,rank9,rank10

后面每行写一条结果,index从0开始计数

输出内容类似如下:

index,rank1,rank2,rank3,rank4,rank5,rank6,rank7,rank8,rank9,rank10
0,C,C,C,C,C,C,C,C,C,C
1,C,C,C,C,C,C,C,C,C,C
...

使用以上述代码在真正的数据集上训练20 epoch后保存的模型进行分子生成的结果:/bohr/AI4SCUP-IRMI-baseline-shdv/v2/simple_baseline/output/submission.csv

生成输出结果所需要的代码如下,为了快捷,以下代码,只在small数据集上跑2个batch:

代码
文本
[8]
import torch
import csv
from tqdm import tqdm
from torch.utils.data import DataLoader


def test(test_data_path, weights_path, batch_size, output_csv):
model = MolecularGenerator(
config_json_path="/bohr/AI4SCUP-IRMI-baseline-shdv/v2/simple_baseline/configs/bart.json",
tokenizer_path="/bohr/AI4SCUP-IRMI-baseline-shdv/v2/simple_baseline/tokenizer-smiles-bart/",
)
model.load_weights(weights_path)
model.cuda()
model.eval()

test_data = DataLoader(
IRDataset(
db_path=test_data_path,
),
shuffle=False,
collate_fn=MyCollator(tokenizer=model.tokenizer, max_length=512),
batch_size=batch_size,
drop_last=False,
)
with open(output_csv, 'w', newline='') as csvfile:
csv_writer = csv.writer(csvfile)
csv_writer.writerow(['index', 'rank1', 'rank2', 'rank3', 'rank4', 'rank5', 'rank6', 'rank7', 'rank8', 'rank9', 'rank10'])
with torch.no_grad():
for idx, i in tqdm(enumerate(test_data)):
result = model.infer(
tokenizer=model.tokenizer, length_penalty=0, num_beams=10, **i
)
for j in range(0, batch_size* 10, 10):
csv_writer.writerow([0]+result[j:j+10])
if idx > 1:
break


test(
"/bohr/AI4SCUP-IRMI-baseline-shdv/v2/test.small.lmdb",
"/bohr/AI4SCUP-IRMI-baseline-shdv/v2/simple_baseline/output/final_20.pt",
16,
"submission.csv"
)
2it [01:14, 37.37s/it]
代码
文本
[9]
! ls /
curve_plot.png	final_0.pt  simple_baseline  submission.csv
代码
文本

评测指标

  • 本次赛题采用top-k指标,指对于每一个红外线光谱,按rank给予k个候选smiles字符串,这个k在本次比赛中设置为10。

  • 在所有test集合的所有结果中,ground-truth smiles在top-1/top-3/top-5/top-10个candidates中命中的百分比是多少。

(加权为 0.4 * top-1 + 0.1 * top-3 + 0.1 * top-5 + 0.4 * top-10)

评测代码

下面是判断两个生成的SMILES是否为同一个分子的代码,我们将用以下标准,进行模型评测

代码
文本
[ ]
from rdkit import Chem
import warnings

warnings.filterwarnings(action="ignore")


def get_InchiKey(smi):
if not smi:
return None
try:
mol = Chem.MolFromSmiles(smi)
except:
return None
if mol is None:
return None
try:
key = Chem.MolToInchiKey(mol)
return key
except:
return None


def judge_InchiKey(key1, key2):
if key1 is None or key2 is None:
return False
return key1 == key2


def same_smi(smi1, smi2):
key1 = get_InchiKey(smi1)
if key1 is None:
return False
key2 = get_InchiKey(smi2)
if key2 is None:
return False
return judge_InchiKey(key1, key2)
代码
文本
AI4S Cup
Tutorial
AI4SCUP-IRMI
中文
AI4S CupTutorialAI4SCUP-IRMI中文
已赞1
本文被以下合集收录
红外光谱
Robyn
更新于 2024-05-21
4 篇0 人关注
推荐阅读
公开
未命名
AI4S
AI4S
酱油哥
发布于 2023-11-30
1 转存文件
公开
AI4S Cup-IRMI tutorial | 基于红外光谱的分子识别
AI4S CupTutorial中文AI4SCUP-IRMI
AI4S CupTutorial中文AI4SCUP-IRMI
octoescaper
发布于 2023-11-05
16 赞76 转存文件8 评论