Acceleration and automation of genomic data analysis to meet corporate compliance standards using advanced cloud components.

Mr Mr. Satyoki Chatterjee,Sanjay Pankaj Choudhary,Dr. Gopal Joshi,Koshatwar,Satyoki Chatterjee,Circulant,Sanjay Koshatwar,Shekhar Seera
Abstract:Recent advancements in High-throughput next-generation sequencing (NGS) technologies grew exponentially in genomic research revolutionizing biological data analysis, and enhancing the study of complex biological systems at an unprecedented scale. The technological limitations of the NGS system are the deluge of genomic data produced. It’s difficult for a single workstation to execute sequential methods and produce results quickly. Efficacy decreases significantly with human interference and to mitigate them, we developed an in-house pipeline, with the help of AWS services and tools like snakemake, kallisto, etc., for automating RNA-seq data analysis. It’s efficient, scalable, reproducible, version-controlled, transparent, and cost-effective for large volumes of data. In this study, we have reviewed the RNA-sequencing technique using AWS to analyze gene expression at the transcriptional level. The systematic approach allows CROs to transfer raw data using an SFTP server, followed by an automated transfer to Simple Storage Service (S3) and preceded by data quality validation. Helper scripts then transfer data from S3 to Elastic File System (EFS), launch the Fastq processing pipeline, clone a GitHub repo of the corresponding project, and leverages AWS Batch to spin up a dynamic Elastic Compute Cloud (EC2) instance as desired. After successful execution, outputs are available in EFS, and actual data analysis is performed using RStudio Workbench ending with automated results archival in S3.
Computer Science
What problem does this paper attempt to address?