Distributed stream processing for genomics pipelines
Misc - october 2017
Télécharger la publication :
Personalized medicine is in great part enabled by the progress in data acquisition technologies for modern biology, such as next-generation sequencing (NGS). Conventional NGS processing workflows are composed by independent tools implementing shared-memory parallelism which communicate by means of intermediate files. With increasing data sizes this approach is showing its limited scalability and robustness characteristics – problems that make it unsuitable for large-scale, population-wide personalized medicine applications. In this work we propose the adoption of the stream computing architecture to make the genomics pipeline more scalable, and fault-tolerant. We implemented the first processing phases for Illumina sequencing data – from raw data to alignment – using the Apache Flink distributed stream processing framework and Apache Kafka. The new pipeline has been tested processing the raw output of an Illumina HiSeq3000 sequencer and producing aligned reads in CRAM format. The results show near optimal scalability characteristics on experiments from 1 to 12 computing nodes, with a speed-up of 9.5x over the conventional solution (which cannot automatically run on multiple nodes). This result is particularly positive considering that the very short runtime of the experiment – less than 15 minutes – makes significant the constant time costs imposed by the overheads of the frameworks.
Références BibTex
@Misc{VPZ17b,
author = {Versaci, F. and Pireddu, L. and Zanetti, G.},
title = {Distributed stream processing for genomics pipelines},
month = {october},
year = {2017},
note = {https://doi.org/10.7287/peerj.preprints.3338v1},
type = {poster},
keywords = {big data, ngs, stream processing, flink, kafka},
url = {https://publications.crs4.it/pubdocs/2017/VPZ17b},
}
Autres publications dans la base