Bohrium
robot
新建

空间站广场

论文
Notebooks
比赛
课程
Apps
我的主页
我的Notebooks
我的论文库
我的足迹

我的工作空间

任务
节点
文件
数据集
镜像
项目
数据库
公开
AI4S Cup -LLM挑战赛-大模型提取“基因-疾病-药物”知识图谱-Rank5-解决方案-[小白Lan]
AI4S
AI4SCUP-LLMKG
AI4SAI4SCUP-LLMKG
小白Lan
发布于 2024-04-22
推荐镜像 :Third-party software:ai4s-cup-metrics:0.3
推荐机型 :c2_m4_cpu
赞 1
2
2
hong(v1)
BioMistral-7B-DARE(v1)

注意:比赛中选手提交的notebook必须满足运行后可在当前目录下生成submission.jsonl文件,否则将影响评分计算。

代码
文本

0 加载测试集

挂载本次比赛的bohr 数据集,以及微调后的baseline model。

代码
文本
[1]
import os
# 为了测评方便,建议指定该文件路径
DATA_PATH = os.getenv('DATA_PATH')
# If DATA_PATH is not set, assign a default value and raise a warning
if not DATA_PATH:
DATA_PATH = '/bohr/AGAC-GDA-0ifh/v8/'
print("Warning: DATA_PATH environment variable is not set. Using default path:", DATA_PATH)
代码
文本

该notebook阅读页面可下载这两个数据集。

代码
文本

1 微调模型的加载

1.1 依赖库的安装和加载

代码
文本
[2]
# ! pip install transformers datasets peft accelerate bitsandbytes safetensors
代码
文本
[3]
import signal
from contextlib import contextmanager

class TimeoutException(Exception):
pass

@contextmanager
def time_limit(seconds):
def signal_handler(signum, frame):
raise TimeoutException("Time limit exceeded")
original_signal_handler = signal.signal(signal.SIGALRM, signal_handler)
try:
signal.alarm(seconds)
yield
finally:
signal.alarm(0)
signal.signal(signal.SIGALRM, original_signal_handler)

def long_running_func():
import time
time.sleep(3)
return "This will never be returned"

try:
with time_limit(2):
result = long_running_func()
print(f"Result: {result}")
except TimeoutException as e:
print(f"Error: {e}")
代码
文本
[4]
import os, sys
import torch
import datasets
from transformers import (
AutoTokenizer,
AutoModelForCausalLM,
BitsAndBytesConfig,
DataCollatorForLanguageModeling,
DataCollatorForSeq2Seq,
Trainer,
TrainingArguments,
GenerationConfig
)
from peft import PeftModel, LoraConfig, prepare_model_for_kbit_training, get_peft_model
代码
文本

1.2 Merge llama-7b with peft (qlora)

代码
文本
[5]
# model path and weight
model_path = "/bohr/BioMistral-7B-DARE-hjn7/v1"
peft_path = "/bohr/BioMistral-7B-DARE-quanzhong-jzeo/v1/checkpoint-6270"

max_length = 2500
device_map = "auto"
batch_size = 128
micro_batch_size = 32
gradient_accumulation_steps = batch_size // micro_batch_size

# "nf4" use a symmetric quantization scheme with 4 bits precision
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)

# loading model
model = AutoModelForCausalLM.from_pretrained(
model_path,
quantization_config=bnb_config,
use_cache=False,
device_map="auto"
# device_map=device_map
)

# # load tokenizer from huggingface
# tokenizer = AutoTokenizer.from_pretrained(model_id)
# load tokenizer from local path
tokenizer = AutoTokenizer.from_pretrained(model_path)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

代码
文本
[ ]

代码
文本

1.3 模型的加载和参数配置

代码
文本
[7]
# loading peft weight
model = PeftModel.from_pretrained(
model,
peft_path,
torch_dtype=torch.float16,
)
model.eval()

# generation config
generation_config = GenerationConfig(
temperature=0.1,
top_p=0.75,
top_k=40,
num_beams=4, # beam search
)

代码
文本

2 模型推理

代码
文本

2.1 定义推理的函数

代码
文本
[ ]

代码
文本
[9]
@torch.no_grad()
def eval(prompt):
try:
with time_limit(90):
inputs = tokenizer(prompt, return_tensors="pt")
generation_output = model.generate(
input_ids=inputs.input_ids,
# generation_config=generation_config,
return_dict_in_generate=True,
output_scores=True,
max_new_tokens=1024, repetition_penalty=1.05, # 1.15 to 1.05
)
return tokenizer.decode(generation_output.sequences[0]) #, skip_special_tokens=True
except TimeoutException as e:
return prompt
代码
文本

2.2 Define Prompts for Baseline

代码
文本
[10]
instruction_task1 = "This is a 【Gene-Disease】 relation extraction task. Extract the following text's triplets in ternary format (GENE, FUNCTION, DISEASE). The second element indicates gene's regulation on the disease, the value should be one of COM, GOF, LOF, REG. LOF and GOF for loss or gain of function; REG for general regulatory relationship; COM for complex functional changes. Return all relations in ternary format (GENE, FUNCTION, DISEASE). Multiple triples should be formatted as '(GENE, FUNCTION, DISEASE),(GENE, FUNCTION, DISEASE),...'."
instruction_task2 = "This is a 【Chemical-Disease】 relation extraction task. Extract the following abstract's relations bilateral format (CHEMICAL, DISEASE). Return all relations in bilateral format (CHEMICAL, DISEASE). Multiple binaries should be formatted as '((CHEMICAL, DISEASE),(CHEMICAL, DISEASE),...)'."
instruction_task3 = "This is a 【Drug-Drug】 relation extraction task. Extract the following text's triplets in ternary format (DRUG, INTERACTION, DRUG). Return all relations in ternary format (DRUG, INTERACTION, DRUG). Multiple triples should be formatted as '(DRUG, INTERACTION, DRUG),(DRUG, INTERACTION, DRUG),...'."

### generate prompt based on template ###
prompt_template = {
"prompt_input": \
"### Instruction:\n{instruction}\n\n### Input:\n{input}\n\n### Response:\n",
"prompt_no_input": \
"### Instruction:\n{instruction}\n\n### Response:\n",
"response_split": "### Response:"
}

def generate_prompt(instruction, input=None, label=None, prompt_template=prompt_template):
return prompt_template["prompt_input"].format(instruction=instruction, input=input)

def AGAC_prompt(AGAC_content):
return generate_prompt(instruction_task1, AGAC_content)

def CDR_prompt(CDR_content):
return generate_prompt(instruction_task2, CDR_content)

def DDI_prompt(DDI_content):
return generate_prompt(instruction_task3, DDI_content)
代码
文本

2.3 答案提取

代码
文本
[11]
import jsonlines
import os
import json
import re
代码
文本

在提取之前,我们先把待提取的submission.jsonl按照task1、task2、task3分别拆分成三个文件。

代码
文本
[12]
def split_jsonl_by_task(input_file_path):
tasks_data = {1: [], 2: [], 3: []}
with open(input_file_path, 'r') as file:
for line in file:
data = json.loads(line)
task_value = data.get("task")
if task_value in tasks_data:
tasks_data[task_value].append(line)
output_files = {}
for task, data_lines in tasks_data.items():
output_file_path = f"testA_task_{task}.jsonl"
with open(output_file_path, 'w') as file:
for line in data_lines:
file.write(line)
output_files[task] = output_file_path
return output_files

split_jsonl_result = split_jsonl_by_task(DATA_PATH + '/submission.jsonl')
split_jsonl_result
代码
文本

首先,分离出大模型的输出。

代码
文本
[ ]
def extract(output_str):
output_str = output_str.split('### Response:')[-1]
start_index = output_str.find("\n")
end_index = output_str.find("</s>")
triples_string = output_str[start_index:end_index].strip()
return triples_string
代码
文本

其次,对大模型输出的字符串进行标准化操作。对于输出结果是二元组/三元组两种情况,需要分别讨论。

代码
文本
[ ]
def process_string(text):
text = text.replace('\n', ',')
# 使用正则表达式识别出三元组
pattern = r'\(([^(),]+),\s*([^(),]+),\s*([^(),]+)\)'
matches = re.findall(pattern, text)
# 去重
unique_matches = list(set(matches))
# 格式化成字符串列表
formatted_matches = [f'({", ".join(match)})' for match in unique_matches]
return ', '.join(formatted_matches)
代码
文本
[ ]
def process_string_2(text):
text = text.replace('\n', ',')
# 使用正则表达式识别出二元组
pattern = r'\(([^(),]+),\s*([^(),]+)\)'
matches = re.findall(pattern, text)
# 去重
unique_matches = list(set(matches))
# 格式化成字符串列表
formatted_matches = [f'({", ".join(match)})' for match in unique_matches]
return ', '.join(formatted_matches)
代码
文本

然后,分别定义写入任务1~3的处理结果的函数。

代码
文本
[ ]
def task1_write(input_file, output_file):
with jsonlines.open(input_file, 'r') as reader, \
jsonlines.open(output_file, 'a') as writer:
t = 0
for item in reader:
t = t+1
text = item["text"]
text = text.replace('\n', '') #删除文本中所有\n换行符
output = eval(AGAC_prompt(text))
try:
processed_output = process_string(extract(output))
print(f"task1: {t} outputs are processed.")
except:
processed_output = ""
print(f"task1: {t} outputs can't be processed.")
item['ideal']["GENE, FUNCTION, DISEASE"] = processed_output
writer.write(item)
代码
文本
[ ]
def task2_write(input_file, output_file):
with jsonlines.open(input_file, 'r') as reader, \
jsonlines.open(output_file, 'a') as writer:
t = 0
for item in reader:
t = t+1
text = item['abstract']
text = text.replace('\n', '') #删除文本中所有\n换行符
output = eval(CDR_prompt(text))
try:
processed_output = process_string_2(extract(output))
print(f"task2: {t} outputs are processed.")
except:
processed_output = ""
print(f"task2: {t} outputs can't be processed.")
item['ideal']["chemical, disease"] = processed_output
writer.write(item)
代码
文本
[ ]
def task3_write(input_file, output_file):
with jsonlines.open(input_file, 'r') as reader, \
jsonlines.open(output_file, 'a') as writer:
t = 0
for item in reader:
t = t+1
text = item['text']
text = text.replace('\n', '') #删除文本中所有\n换行符
output = eval(DDI_prompt(text))
try:
processed_output = process_string(extract(output))
print(f"task3: {t} outputs are processed.")
except:
processed_output = ""
print(f"task3: {t} outputs can't be processed.")
item['ideal']['DDI-triples'] = processed_output
writer.write(item)
代码
文本

2.4 小规模测试

实测单线程执行一条推理任务时,机器的占用率:

alt

代码
文本

2.4.1 任务一

代码
文本
[ ]
text1 = eval(AGAC_prompt("N88S mutation in the BSCL2 gene in a Serbian family with distal hereditary motor neuropathy type V or Silver syndrome. BACKGROUND: Distal hereditary motor neuropathy type V (dHMN-V) and Silver syndrome are rare phenotypically overlapping diseases which can be caused by mutations in the Berardinelli-Seip Congenital Lipodystrophy 2 (BSCL2) gene or Seipin. AIM: To report the first Serbian family with a BSCL2 mutation showing variable expression within the family. PATIENTS AND METHODS: A 55-year-old woman presented with weakness of both hands at the age of 45. At age 47, she noticed distal muscle weakness and atrophy in her legs. Physical examination revealed atrophy and weakness of small hand muscles and mild atrophy and weakness of the lower limbs. There was generalized hyperreflexia with the exception of ankle reflexes which were diminished. Her 25year-old son had only stiffness of both legs at the age of 22. Physical examination revealed only generalized hyporeflexia. The third affected member in this family was her 55year-old cousin who showed a more prominent involvement of leg muscles with mild asymmetrical weakness of hand muscles and no pyramidal tract features. RESULTS: In all three patients sensory nerve conduction velocities (NCV) were normal in all extremities. Compound muscle action potential (CMAP) amplitudes were markedly reduced in all patients. Concentric needle EMG showed evidence of chronic denervation in distal muscles. DNA sequencing of BSCL2 was performed and a heterozygous N88S missense mutation in BSCL2 gene was detected in all three patients. CONCLUSION: This report is further confirmation of phenotypic heterogenity due to the N88S mutation of BSCL2 gene in the same family."))
text1
代码
文本
[ ]
processed_text1 = extract(text1)
processed_text1
代码
文本
[ ]
processed_string1 = process_string(processed_text1)
processed_string1
代码
文本

这里processed_string为空值的原因是:LLM对该段文本提取出的是四元组、而非三元组。

代码
文本

2.4.2 任务三

代码
文本
[ ]
text3 = eval(DDI_prompt("Isocarboxazid should be administered with caution to patients receiving Antabuse (disulfiram, Wyeth-Ayerst Laboratories). In a single study, rats given high intraperitoneal doses of an MAO inhibitor plus disulfiram experienced severe toxicity, including convulsions and death. Concomitant use of Isocarboxazid and other psychotropic agents is generally not recommended because of possible potentiating effects. This is especially true in patients who may subject themselves to an overdosage of drugs. If combination therapy is needed, careful consideration should be given to the pharmacology of all agents to be used. The monoamine oxidase inhibitory effects of Isocarboxazid may persist for a substantial period after discontinuation of the drug, and this should be borne in mind when another drug is prescribed following Isocarboxazid. To avoid potentiation, the physician wishing to terminate treatment with Isocarboxazid and begin therapy with another agent should allow for an interval of 10 days."))
text3
代码
文本
[ ]
processed_text3 = extract(text3)
processed_text3
代码
文本
[ ]
processed_string3 = process_string(processed_text3)
processed_string3
代码
文本

2.4.3 任务二

代码
文本
[ ]
text2 = eval(CDR_prompt("We report the case of a patient who developed acute hepatitis with extensive hepatocellular necrosis, 7 months after the onset of administration of clotiazepam, a thienodiazepine derivative. Clotiazepam withdrawal was followed by prompt recovery. The administration of several benzodiazepines, chemically related to clotiazepam, did not interfere with recovery and did not induce any relapse of hepatitis. This observation shows that clotiazepam can induce acute hepatitis and suggests that there is no cross hepatotoxicity between clotiazepam and several benzodiazepines."))
text2
代码
文本
[ ]
processed_text2 = extract(text2)
processed_text2
代码
文本
[ ]
processed_string2 = process_string_2(processed_text2)
processed_string2
代码
文本

3 开始运行!

将处理的结果写入submission.jsonl文件中,选择追加模式"a"。

代码
文本
[ ]
task1_write("testA_task_1.jsonl", "submission.jsonl")
代码
文本
[ ]
task2_write("testA_task_2.jsonl", "submission.jsonl")
代码
文本
[ ]
task3_write("testA_task_3.jsonl", "submission.jsonl")
代码
文本

注意:必须按照任务一25条、任务二224条、任务三50条的顺序和数量进行合并,否则将影响评分。

代码
文本

注意,提交的文件名必须命名为"submission.jsonl"!

代码
文本
AI4S
AI4SCUP-LLMKG
AI4SAI4SCUP-LLMKG
已赞1
推荐阅读
公开
Baseline Notebook for AI4SCup-LLMKG (2) —— 合并模型、推理、评测
AI4SCUP-LLMKG
AI4SCUP-LLMKG
Judy Lin
发布于 2024-02-19
48 转存文件
公开
AI4S Cup - LLM挑战赛 - 大模型提取“基因-疾病-药物”知识图谱-rank9-xia
AI4SAI4SCUP-LLMKG
AI4SAI4SCUP-LLMKG
111111111
发布于 2024-05-08