Leveraging stream processing technology for genomics
14th Annual Meeting of the Bioinformatics Italian Society - 2017
* Motivation *
More economically accessible next-generation sequencing (NGS) has opened up to a myriad of new applications that were previously technologically or economically unfeasible, such as research into understanding human genetic diseases, oncology, human phylogeny, and personalized diagnostic applications. The resulting growth in data production rates challenge processing pipelines with the requirement of scalable computing tools that can keep up with such a massive data generation throughput. The raw data produced by NGS needs to go through various intense processing steps to extract biologically relevant information. To date, it appears that most sequencing centers have opted to implement processing systems based on conventional software running on High-Performance Computing (HPC) infrastructure -- a set of computing nodes accessed through a batch queuing system and equipped with a parallel shared storage system. While with enough effort and equipment this solution can certainly be made to work, it presents some issues that need to be addressed. Two important ones are that developers need to implement a general way to divide the work of a single job among all computing nodes and, since the probability of node failures increases with the number of nodes, they also need to make the system robust to transient or permanent hardware failures, recovering automatically and bringing the job to successful completion. Nevertheless, even with these measures, the architecture of the HPC cluster limits the maximum throughput of the system because it is, usually, centered around a single shared storage volume, which tends to become the bottleneck as the number of computing nodes increases -- and this is especially true for some phases of sequence processing which can perform a lot of I/O with respect to processing activity.
* Methods *
Our novel approach to processing sequencing data adopts a strategy completely different from the status quo by processing raw sequencing data using Hadoop MapReduce and Apache Flink. To the best of the authors' knowledge, this is the first solution that can process the sequencer's raw data directly on a distributed platform. In brief, in this work we present:
- a complete and scalable Hadoop-based pipeline to align DNA sequences starting from raw data;
- an efficient Flink-based tool to convert from BCL to FASTQ formats;
- the Read Aligner API (RAPI), which encapsulates aligner functionality and provides C and Python bindings;
- improvements to the efficiency of the aligner in the Seal suite.
To evaluate the speed and scalability of our YARN-based sequence processing workflow we ran it on a real human sequencing dataset, with a varying number of nodes, processing the raw data in BCL format produced by the sequencers, reconstructing the DNA reads and aligning them to a reference genome. All experiments were run on the Amazon Elastic Compute Cloud (EC2).
* Results *
Our experiments show that the pipeline has excellent scalability characteristics, such that a sequencing center could reasonably aim to reduce their processing time per sequencing run to under an hour with the use of a small YARN cluster. Moreover, our solution performs better than the baseline even on a single computing node.
The work we present is an excellent complement to work currently being done by the GATK group to bring the sequence analysis downstream of alignment to the YARN platform; combining our tools one could have a complete YARN-based pipeline for NGS data, and then further improve performance by adopting an in-memory file system such as Apache Arrow, thus removing the need to write intermediate data to disk.
The code presented in this work is available as open source at https://github.com/crs4/seal
*Info *
An extended version of this work has been presented at the IEEE International Conference on Big Data, Washington D.C., USA, 2016
Références BibTex
@InProceedings{VPZ17,
author = {Versaci, F. and Pireddu, L. and Zanetti, G.},
title = {Leveraging stream processing technology for genomics},
booktitle = {14th Annual Meeting of the Bioinformatics Italian Society},
year = {2017},
keywords = {NGS, genomics, Apache Flink, streaming, Hadoop},
url = {https://publications.crs4.it/pubdocs/2017/VPZ17},
}
Autres publications dans la base