Automatic Medical Text Simplification - Challenges of Data Quality and Curation.
Chandrayee Basu,Rosni Vasu,Michihiro Yasunaga,Sohyeong Kim,Qian Yang
2021-01-01
Abstract:Health Literacy is the degree to which individuals can comprehend basic health information needed to make appropriate health decisions. The topmost reason for low health literacy is the vocabulary gap between providers and patients. Automatic medical text simplification can contribute to improving health literacy by assisting providers with patientfriendly communication, improving health data search, and making online medical texts more accessible. It is, however, extremely challenging to curate quality corpus for this natural language processing (NLP) task. In this position paper, we observe that, despite recent research efforts, existing open corpora for medical text simplification are poor in quality and size. In order to match the progress in general text simplification and style transfer, we must leverage careful crowdsourcing. We discuss the challenges of naive crowd-sourcing. We propose that careful crowd-sourcing for medical text simplification is possible, when combined with automatic data labeling, a well-designed expert-layman collaboration framework, and context-dependent crowd-sourcing instructions. Low health literacy has been associated with non-adherence to treatment plans and regimens, poor patient self-care, lack of timely communication of health issues, and increased risk of hospitalization and mortality (King 2010). Simplification of medical documents, of online communications like email messages and patient instructions can go a long way to mitigate health literacy challenges. While the consumer versions of medical journals, news articles, and a few trusted websites (NIA 2018; Savery et al. 2020) are written by trained experts, they are by no means exhaustive. Automated approaches are necessary to keep pace with the rapidly growing body of biomedical literature. In this work, we evaluate some of the open corpora that power automated text simplification in the medical domain. We define text simplification, following Siddharthan (2014), as the process of reducing the linguistic complexity of a text, while still retaining the original information content and meaning. A domain-specific expert text undergoes various kinds of transformations to reach the final simple Copyright © 2021for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). form. Research in automatic non-medical text simplification has been burgeoning, with the introduction of large parallel corpora (Zhu, Bernhard, and Gurevych 2010; Woodsend and Lapata 2011; Coster and Kauchak 2011; Xu, CallisonBurch, and Napoles 2015; Paetzold and Specia 2017). Creation of multi-references enabled models that can learn different kinds of textual transformations separately, viz. lexical changes (e.g. paraphrasing), syntactic modifications (e.g. reordering of concepts, splitting texts, reducing sentence length etc.) and compression (e.g. deleting peripheral information irrelevant to the target domain) (Alva-Manchego et al. 2020). References are gold standard human generated simplifications, used to validate model outputs. The success of the automatic text simplification and style transfer hinges on large amounts of crowd-sourced multiple references. However, crowd-sourcing even a single set of references for medical texts is challenging. It requires the recruitment of a specific sub-population with a certain degree of domain expertise. For example, Nye et al. (2018) described an elaborate process of recruiting MDs and medical experts from Upwork, for PICO data annotation. Naturally, we observe a dearth of high-quality parallel training corpus in medical AI. Furthermore, text simplification task has additional challenges. Only the expert knows what content of the domain-specific text is relevant to the laymen, whereas the laymen or medical writers trained to translate medical texts can judge the quality and accessibility of the simplified versions. In this work, we make the following contributions: • identify the open-source datasets for medical text simplification • characterize the datasets by their quantity, quality, diversity, and representativeness • identify challenges of scaling high-quality corpus generation for medical text simplification Assumptions: We treat summarization as a subset of text simplification. We only consider corpora that represent composite textual transformations (simple text is derived after a combination of syntactic, semantic, thematic, and lexical transformations of the expert text) (Lyu et al. 2021) for further analysis. Datasets for Medical Text Simplification Datasets for medical text simplification support two kinds of document simplification: sentence-level and paragraphlevel. We focus on sentence-level and short paragraph-level simplification. After an elaborate search, we found three datasets in English for medical text simplification: two parallel corpora SIMPWIKI (Van den Bercken, Sips, and Lofi 2019) and PARASIMP (Devaraj et al. 2021), and one nonparallel corpus MSD (Cao et al. 2020). Next, we delve deeper into how these datasets are created and the potential artifacts of the data collection and annotation processes. Artifacts of Corpus Curation In the absence of reliable crowd-sourcing of medical texts, researchers resort to crawling medical websites. The expert texts are sampled from the online articles and checked posthoc for adequate corpus representativeness. The layman texts are retrieved from the layman or consumer versions of the professional articles, based on the alignment of section titles and text content. The alignment is either checked manually for a small fraction of the corpus or automatically derived using different algorithms. Only a few of the automatically aligned pairs are validated by the experts. Automatic alignment is not always reasonable (Alva-Manchego, Scarton, and Specia 2020). Random sampling of expert texts from larger articles and unreliable automatic retrieval can lead to text pieces that are not stand-alone (Choi et al. 2021). We found that the process of expert verification is insufficient for quality data curation and could still lead to pairs lacking correspondence. On the other end, models trained using highly aligned text pairs may exhibit limited generalizability. A more recent trend is to generate large volumes of non-parallel corpus, obviating validation of automatically aligned pairs. This follows similar approaches in nonmedical text style transfer (Shen et al. 2017; He and McAuley 2016; Madaan et al. 2020). Some researchers distinguish between text simplification and text style transfer tasks. We consider text simplification as a sub-domain of text style transfer where the goal is to transform text from the expert style to the layman style.