Flink in genomics: Integrating Flink and Kafka in the standard genomic pipeline

Francesco Versaci, Luca Pireddu, Gianluigi Zanetti

Misc - september 2017

DNA high-throughput sequencing is a key process, which enables dozen of important applications, from oncology to personalized diagnostic. Following up on last year presentation, we will see how to integrate Flink and Kafka in the standard pipeline for genomic processing. After raw data from sequencers are read, we need to feed them to an aligner, in order to reconstruct the whole genome from the short reads available. To this purpose we have extended the standard BWA-MEM, providing a Java API which can be called by Flink. Our Flink-based genomic processor consists of distinct specialized modules, loosely linked via communication through Kafka streams, in order to allow for easy composability and integration in the already existing Hadoop-based pipelines. We will discuss how to dynamically generate and consume Kafka topics within Flink, focusing in particular on how to transfer finite streams, since we are interested in writing out these streams using some already existing Hadoop output formats.

Références BibTex

@Misc{VPZ17a,
  author       = {Versaci, F. and Pireddu, L. and Zanetti, G.},
  title        = {Flink in genomics: Integrating Flink and Kafka in the standard genomic pipeline},
  month        = {september},
  year         = {2017},
  type         = {Presentation at workshop FlinkForward 2017},
  keywords     = {Big data, Apache Flink, Apache Kafka, Genomics, NGS},
  url          = {https://publications.crs4.it/pubdocs/2017/VPZ17a},
}

Autres publications dans la base

» Francesco Versaci
» Luca Pireddu
» Gianluigi Zanetti