Data alchemy, from lab to insight: Transforming in vivo experiments into data science gold
Troy J. Kieran,Taronna R. Maines,Jessica A. Belser
DOI: https://doi.org/10.1371/journal.ppat.1012460
IF: 7.464
2024-09-02
PLoS Pathogens
Abstract:When conducting an in vivo experiment, researchers will typically collect a diverse array of qualitative observations and quantitative measurements. Therefore, choosing data that are most relevant for aggregation and tidying is a crucial first step (Fig 1). Care must be taken when combining data from multiple studies to determine which data points are most consistently collected between experiments (especially if different research staff are conducting the work). This may require excluding specific parameters for analysis (like lethargy or animal activity level) which may be more vulnerable to laboratorian bias depending on the specific standardized assessment employed. To reduce experimental confounders, studies intended for aggregation should be conducted under as uniform or standard conditions as possible (with these inclusion criteria explicitly stated within the analysis) [1–3]. As any research scientist will attest, in vivo-generated data is highly heterogeneous, particularly when using outbred species. Variability may be present in baseline (pre-inoculation) animal age, weight, temperature, activity level, blood chemistry, and innate immune response parameters, among others. Inoculation (e.g., infectious dose) and post-inoculation (e.g., specimen collection) variability can also be present. As most studies assessing viral pathogenicity report changes relative to baseline, normalizing raw data to reflect a linear or percentage-based deviation from baseline will typically yield aggregate data with less standard error and greater uniformity, and represents a best practice in the field [4]. Normalization can typically occur before or after aggregation. It is frequently desirable to contextualize in vivo-derived outcomes with genotypic data [5–8]; however, these data must be similarly curated before further analysis, with reliable consensus sequence data available for aggregation and use (Fig 2). Will full-length genetic sequences be assessed, or will specific molecular residues that are known to affect the tested variable be sufficient [9]? Molecular residues are often compensatory in nature; will researchers build new data set columns with anticipated phenotypic outcomes from constellations of specific amino acids at key positions (like predicted receptor binding preference or length of an accessory protein)? If laboratory-generated data will be included, have researchers ensured reproducibility of aggregated experiments performed over time [10], with oversight for potential dual-use research of concern? Considering the scope of information that can be obtained from in vivo, in vitro, and molecular analyses, selecting input data for subsequent processing represents a challenging endeavor.
microbiology,virology,parasitology