The CALBC RDF Triple store: retrieval over large literature content

Samuel Croset,Christoph Grabmüller,Chen Li,Silvestras Kavaliauskas,Dietrich Rebholz-Schuhmann
DOI: https://doi.org/10.1038/npre.2011.5383.2
2011-01-01
Nature Precedings
Abstract:Background Integration of the scientific literature into a biomedical research infrastructure requires the processing of the literature, identification of the contained named entities (NEs) and concepts, and to represent the content in a standardised way. Little efforts have been spent on the integration of content from the literature text into RDF Triple Stores.The CALBC project partners (PPs) have produced a large-scale annotated biomedical corpus with four different semantic groups through the harmonisation of annotations from automatic text mining solutions, the first version of the Silver Standard Corpus I (SSC-I). The four semantic groups were chemical entities and drugs (CHED), genes and proteins (PRGE), diseases and disorders (DISO) and species (SPE). The annotations of the corpus has been transformed into RDF Triple Store representation to query the content in combination with bioinformatics data resources (UniProtKb, ArrayExpress) using RDF query language (SPARQL). Results All four PPs from the CALBC project contributed annotated data sets for generating the SSC-I and in addition, 12 challenge participants (CPs) provided annotated data sets for evaluation against the SSC-I and for the generation of the SSC-II. The SSC-II contains the following annotations: CHED 238,431, PRGE 435,797, DISO 245,524, and SPE 304,503. The content of the SSC-II has been fully integrated into RDF Triple Store (4,568,678 triples) and has been aligned with content from the GeneAtlas (182,840 triples), UniProtKb (12,552,239 triples for human) and the lexical resource LexEBI (BioLexicon). RDF Triple Store enables querying the scientific literature and bioinformatics resources at the same time for evidence for gene-disease links that involve immunological processes. In total the CALBC RDF Triple Store makes use of 1,224,255 annotations in the corpus for exposing links between the entities supported by the evidence in the text. RDF Triple Store is implemented as a retrieval engine that allows querying for collocations of named entities and associated relevant information from the bioinformatics data resources (UniProtKb, ArrayExpress). Conclusions The CALBC RDF Triple Store is the first of its kind that exposes content extracted from the scientific literature in combination with a large scale terminological resource to enable querying for causes of immunological diseases across the most relevant bioinformatics data resources.
What problem does this paper attempt to address?