AI4S Cup -LLM挑战赛-大模型提取“基因-疾病-药物”知识图谱-Rank5-解决方案-[小白Lan]

空间站广场

论文

Notebooks

比赛

课程

Apps

我的主页

我的Notebooks

我的论文库

我的足迹

我的工作空间

任务

节点

文件

数据集

镜像

项目

数据库

公开

AI4S Cup -LLM挑战赛-大模型提取“基因-疾病-药物”知识图谱-Rank5-解决方案-[小白Lan]

AI4S

AI4SCUP-LLMKG

AI4SAI4SCUP-LLMKG

小白Lan

发布于 2024-04-22

推荐镜像 :Third-party software:ai4s-cup-metrics:0.3

推荐机型 :c2_m4_cpu

数据集

hong(v1)

BioMistral-7B-DARE(v1)

注意：比赛中选手提交的notebook必须满足运行后可在当前目录下生成submission.jsonl文件，否则将影响评分计算。

代码

文本

0 加载测试集

挂载本次比赛的bohr 数据集，以及微调后的baseline model。

代码

文本

[1]

import os

# 为了测评方便，建议指定该文件路径

DATA_PATH = os.getenv('DATA_PATH')

# If DATA_PATH is not set, assign a default value and raise a warning

if not DATA_PATH:

DATA_PATH = '/bohr/AGAC-GDA-0ifh/v8/'

print("Warning: DATA_PATH environment variable is not set. Using default path:", DATA_PATH)

代码

文本

该notebook阅读页面可下载这两个数据集。

代码

文本

1 微调模型的加载

1.1 依赖库的安装和加载

代码

文本

[2]

# ! pip install transformers datasets peft accelerate bitsandbytes safetensors

代码

文本

[3]

import signal

from contextlib import contextmanager

class TimeoutException(Exception):

pass

@contextmanager

def time_limit(seconds):

def signal_handler(signum, frame):

raise TimeoutException("Time limit exceeded")

original_signal_handler = signal.signal(signal.SIGALRM, signal_handler)

try:

signal.alarm(seconds)

yield

finally:

signal.alarm(0)

signal.signal(signal.SIGALRM, original_signal_handler)

def long_running_func():

import time

time.sleep(3)

return "This will never be returned"

try:

with time_limit(2):

result = long_running_func()

print(f"Result: {result}")

except TimeoutException as e:

print(f"Error: {e}")

代码

文本

[4]

import os, sys

import torch

import datasets

from transformers import (

AutoTokenizer,

AutoModelForCausalLM,

BitsAndBytesConfig,

DataCollatorForLanguageModeling,

DataCollatorForSeq2Seq,

Trainer,

TrainingArguments,

GenerationConfig

)

from peft import PeftModel, LoraConfig, prepare_model_for_kbit_training, get_peft_model

代码

文本

1.2 Merge llama-7b with peft (qlora)

代码

文本

[5]

# model path and weight

model_path = "/bohr/BioMistral-7B-DARE-hjn7/v1"

peft_path = "/bohr/BioMistral-7B-DARE-quanzhong-jzeo/v1/checkpoint-6270"

max_length = 2500

device_map = "auto"

batch_size = 128

micro_batch_size = 32

gradient_accumulation_steps = batch_size // micro_batch_size

# "nf4" use a symmetric quantization scheme with 4 bits precision

bnb_config = BitsAndBytesConfig(

load_in_4bit=True,

bnb_4bit_use_double_quant=True,

bnb_4bit_quant_type="nf4",

bnb_4bit_compute_dtype=torch.bfloat16

)

# loading model

model = AutoModelForCausalLM.from_pretrained(

model_path,

quantization_config=bnb_config,

use_cache=False,

device_map="auto"

# device_map=device_map

)

# # load tokenizer from huggingface

# tokenizer = AutoTokenizer.from_pretrained(model_id)

# load tokenizer from local path

tokenizer = AutoTokenizer.from_pretrained(model_path)

tokenizer.pad_token = tokenizer.eos_token

tokenizer.padding_side = "right"

代码

文本

[ ]

代码

文本

1.3 模型的加载和参数配置

代码

文本

[7]

# loading peft weight

model = PeftModel.from_pretrained(

model,

peft_path,

torch_dtype=torch.float16,

)

model.eval()

# generation config

generation_config = GenerationConfig(

temperature=0.1,

top_p=0.75,

top_k=40,

num_beams=4, # beam search

)

代码

文本

2 模型推理

代码

文本

2.1 定义推理的函数

代码

文本

[ ]

代码

文本

[9]

@torch.no_grad()

def eval(prompt):

try:

with time_limit(90):

inputs = tokenizer(prompt, return_tensors="pt")

generation_output = model.generate(

input_ids=inputs.input_ids,

# generation_config=generation_config,

return_dict_in_generate=True,

output_scores=True,

max_new_tokens=1024, repetition_penalty=1.05, # 1.15 to 1.05

)

return tokenizer.decode(generation_output.sequences[0]) #, skip_special_tokens=True

except TimeoutException as e:

return prompt

代码

文本

2.2 Define Prompts for Baseline

代码

文本

[10]

instruction_task1 = "This is a 【Gene-Disease】 relation extraction task. Extract the following text's triplets in ternary format (GENE, FUNCTION, DISEASE). The second element indicates gene's regulation on the disease, the value should be one of COM, GOF, LOF, REG. LOF and GOF for loss or gain of function; REG for general regulatory relationship; COM for complex functional changes. Return all relations in ternary format (GENE, FUNCTION, DISEASE). Multiple triples should be formatted as '(GENE, FUNCTION, DISEASE),(GENE, FUNCTION, DISEASE),...'."

instruction_task2 = "This is a 【Chemical-Disease】 relation extraction task. Extract the following abstract's relations bilateral format (CHEMICAL, DISEASE). Return all relations in bilateral format (CHEMICAL, DISEASE). Multiple binaries should be formatted as '((CHEMICAL, DISEASE),(CHEMICAL, DISEASE),...)'."

instruction_task3 = "This is a 【Drug-Drug】 relation extraction task. Extract the following text's triplets in ternary format (DRUG, INTERACTION, DRUG). Return all relations in ternary format (DRUG, INTERACTION, DRUG). Multiple triples should be formatted as '(DRUG, INTERACTION, DRUG),(DRUG, INTERACTION, DRUG),...'."

### generate prompt based on template ###

prompt_template = {

"prompt_input": \

"### Instruction:\n{instruction}\n\n### Input:\n{input}\n\n### Response:\n",

"prompt_no_input": \

"### Instruction:\n{instruction}\n\n### Response:\n",

"response_split": "### Response:"

}

def generate_prompt(instruction, input=None, label=None, prompt_template=prompt_template):

return prompt_template["prompt_input"].format(instruction=instruction, input=input)

def AGAC_prompt(AGAC_content):

return generate_prompt(instruction_task1, AGAC_content)

def CDR_prompt(CDR_content):

return generate_prompt(instruction_task2, CDR_content)

def DDI_prompt(DDI_content):

return generate_prompt(instruction_task3, DDI_content)

代码

文本

2.3 答案提取

代码

文本

[11]

import jsonlines

import os

import json

import re

代码

文本

在提取之前，我们先把待提取的submission.jsonl按照task1、task2、task3分别拆分成三个文件。

代码

文本

[12]

def split_jsonl_by_task(input_file_path):

tasks_data = {1: [], 2: [], 3: []}

with open(input_file_path, 'r') as file:

for line in file:

data = json.loads(line)

task_value = data.get("task")

if task_value in tasks_data:

tasks_data[task_value].append(line)

output_files = {}

for task, data_lines in tasks_data.items():

output_file_path = f"testA_task_{task}.jsonl"

with open(output_file_path, 'w') as file:

for line in data_lines:

file.write(line)

output_files[task] = output_file_path

return output_files

split_jsonl_result = split_jsonl_by_task(DATA_PATH + '/submission.jsonl')

split_jsonl_result

代码

文本

首先，分离出大模型的输出。

代码

文本

[ ]

def extract(output_str):

output_str = output_str.split('### Response:')[-1]

start_index = output_str.find("\n")

end_index = output_str.find("</s>")

triples_string = output_str[start_index:end_index].strip()

return triples_string

代码

文本

其次，对大模型输出的字符串进行标准化操作。对于输出结果是二元组/三元组两种情况，需要分别讨论。

代码

文本

[ ]

def process_string(text):

text = text.replace('\n', ',')

# 使用正则表达式识别出三元组

pattern = r'\(([^(),]+),\s*([^(),]+),\s*([^(),]+)\)'

matches = re.findall(pattern, text)

# 去重

unique_matches = list(set(matches))

# 格式化成字符串列表

formatted_matches = [f'({", ".join(match)})' for match in unique_matches]

return ', '.join(formatted_matches)

代码

文本

[ ]

def process_string_2(text):

text = text.replace('\n', ',')

# 使用正则表达式识别出二元组

pattern = r'\(([^(),]+),\s*([^(),]+)\)'

matches = re.findall(pattern, text)

# 去重

unique_matches = list(set(matches))

# 格式化成字符串列表

formatted_matches = [f'({", ".join(match)})' for match in unique_matches]

return ', '.join(formatted_matches)

代码

文本

然后，分别定义写入任务1~3的处理结果的函数。

代码

文本

[ ]

def task1_write(input_file, output_file):

with jsonlines.open(input_file, 'r') as reader, \

jsonlines.open(output_file, 'a') as writer:

t = 0

for item in reader:

t = t+1

text = item["text"]

text = text.replace('\n', '') #删除文本中所有\n换行符

output = eval(AGAC_prompt(text))

try:

processed_output = process_string(extract(output))

print(f"task1: {t} outputs are processed.")

except:

processed_output = ""

print(f"task1: {t} outputs can't be processed.")

item['ideal']["GENE, FUNCTION, DISEASE"] = processed_output

writer.write(item)

代码

文本

[ ]

def task2_write(input_file, output_file):

with jsonlines.open(input_file, 'r') as reader, \

jsonlines.open(output_file, 'a') as writer:

t = 0

for item in reader:

t = t+1

text = item['abstract']

text = text.replace('\n', '') #删除文本中所有\n换行符

output = eval(CDR_prompt(text))

try:

processed_output = process_string_2(extract(output))

print(f"task2: {t} outputs are processed.")

except:

processed_output = ""

print(f"task2: {t} outputs can't be processed.")

item['ideal']["chemical, disease"] = processed_output

writer.write(item)

代码

文本

[ ]

def task3_write(input_file, output_file):

with jsonlines.open(input_file, 'r') as reader, \

jsonlines.open(output_file, 'a') as writer:

t = 0

for item in reader:

t = t+1

text = item['text']

text = text.replace('\n', '') #删除文本中所有\n换行符

output = eval(DDI_prompt(text))

try:

processed_output = process_string(extract(output))

print(f"task3: {t} outputs are processed.")

except:

processed_output = ""

print(f"task3: {t} outputs can't be processed.")

item['ideal']['DDI-triples'] = processed_output

writer.write(item)

代码

文本

2.4 小规模测试

实测单线程执行一条推理任务时，机器的占用率：

alt

代码

文本

2.4.1 任务一

代码

文本

[ ]

text1 = eval(AGAC_prompt("N88S mutation in the BSCL2 gene in a Serbian family with distal hereditary motor neuropathy type V or Silver syndrome. BACKGROUND: Distal hereditary motor neuropathy type V (dHMN-V) and Silver syndrome are rare phenotypically overlapping diseases which can be caused by mutations in the Berardinelli-Seip Congenital Lipodystrophy 2 (BSCL2) gene or Seipin. AIM: To report the first Serbian family with a BSCL2 mutation showing variable expression within the family. PATIENTS AND METHODS: A 55-year-old woman presented with weakness of both hands at the age of 45. At age 47, she noticed distal muscle weakness and atrophy in her legs. Physical examination revealed atrophy and weakness of small hand muscles and mild atrophy and weakness of the lower limbs. There was generalized hyperreflexia with the exception of ankle reflexes which were diminished. Her 25year-old son had only stiffness of both legs at the age of 22. Physical examination revealed only generalized hyporeflexia. The third affected member in this family was her 55year-old cousin who showed a more prominent involvement of leg muscles with mild asymmetrical weakness of hand muscles and no pyramidal tract features. RESULTS: In all three patients sensory nerve conduction velocities (NCV) were normal in all extremities. Compound muscle action potential (CMAP) amplitudes were markedly reduced in all patients. Concentric needle EMG showed evidence of chronic denervation in distal muscles. DNA sequencing of BSCL2 was performed and a heterozygous N88S missense mutation in BSCL2 gene was detected in all three patients. CONCLUSION: This report is further confirmation of phenotypic heterogenity due to the N88S mutation of BSCL2 gene in the same family."))

text1

代码

文本

[ ]

processed_text1 = extract(text1)

processed_text1

代码

文本

[ ]

processed_string1 = process_string(processed_text1)

processed_string1

代码

文本

这里processed_string为空值的原因是：LLM对该段文本提取出的是四元组、而非三元组。

代码

文本

2.4.2 任务三

代码

文本

[ ]

text3 = eval(DDI_prompt("Isocarboxazid should be administered with caution to patients receiving Antabuse (disulfiram, Wyeth-Ayerst Laboratories). In a single study, rats given high intraperitoneal doses of an MAO inhibitor plus disulfiram experienced severe toxicity, including convulsions and death. Concomitant use of Isocarboxazid and other psychotropic agents is generally not recommended because of possible potentiating effects. This is especially true in patients who may subject themselves to an overdosage of drugs. If combination therapy is needed, careful consideration should be given to the pharmacology of all agents to be used. The monoamine oxidase inhibitory effects of Isocarboxazid may persist for a substantial period after discontinuation of the drug, and this should be borne in mind when another drug is prescribed following Isocarboxazid. To avoid potentiation, the physician wishing to terminate treatment with Isocarboxazid and begin therapy with another agent should allow for an interval of 10 days."))

text3

代码

文本

[ ]

processed_text3 = extract(text3)

processed_text3

代码

文本

[ ]

processed_string3 = process_string(processed_text3)

processed_string3

代码

文本

2.4.3 任务二

代码

文本

[ ]

text2 = eval(CDR_prompt("We report the case of a patient who developed acute hepatitis with extensive hepatocellular necrosis, 7 months after the onset of administration of clotiazepam, a thienodiazepine derivative. Clotiazepam withdrawal was followed by prompt recovery. The administration of several benzodiazepines, chemically related to clotiazepam, did not interfere with recovery and did not induce any relapse of hepatitis. This observation shows that clotiazepam can induce acute hepatitis and suggests that there is no cross hepatotoxicity between clotiazepam and several benzodiazepines."))

text2

代码

文本

[ ]

processed_text2 = extract(text2)

processed_text2

代码

文本

[ ]

processed_string2 = process_string_2(processed_text2)

processed_string2

代码

文本

3 开始运行！

将处理的结果写入submission.jsonl文件中，选择追加模式"a"。

代码

文本

[ ]

task1_write("testA_task_1.jsonl", "submission.jsonl")

代码

文本

[ ]

task2_write("testA_task_2.jsonl", "submission.jsonl")

代码

文本

[ ]

task3_write("testA_task_3.jsonl", "submission.jsonl")

代码

文本

注意：必须按照任务一25条、任务二224条、任务三50条的顺序和数量进行合并，否则将影响评分。

代码

文本

注意，提交的文件名必须命名为"submission.jsonl"！

代码

文本

AI4S

AI4SCUP-LLMKG

AI4SAI4SCUP-LLMKG

已赞1