SNP genotype calling with MapReduce

Simone Leo, Luca Pireddu, Gianluigi Zanetti
Proceedings of the third international workshop on MapReduce and its Applications, page 49--56 - 2012
Download the publication : map406-leo.pdf [456Ko]  
Genotype measurement is a key step in genome-wide association studies -- those studies that aim to uncover the underlying genetic causes of physical traits, including disease. The leading technology for measuring genotypes is the SNP microarray, where hundreds of thousands of genetic variants are interrogated simultaneously. For some of the most commonly used high-throughput genotyping technologies, the conversion from raw measured data to genotype calls (i.e., identifying the specific genomic variants) requires the concurrent analysis of many samples, with the quality of the results crucially depending on the size of the batch. However, current software for microarray analysis is characterized by poor scalability with respect to input batch sizes. In large-scale studies, this limits the ability to harness the large number of samples available to improve the accuracy of genotype calling. Here, we present a scalable MapReduce application that offers both greater scalability and flexibility than the current state-of-the-art. The software can process datasets as large as 7000 samples in a day, it is more than one order of magnitude faster than previous solutions, and it is currently used in production.

BibTex references

  author       = {Leo, S. and Pireddu, L. and Zanetti, G.},
  title        = {SNP genotype calling with MapReduce},
  booktitle    = {Proceedings of the third international workshop on MapReduce and its Applications},
  series       = {MapReduce '12},
  pages        = {49--56},
  year         = {2012},
  publisher    = {ACM},
  keywords     = {genotyping,mapreduce},
  doi          = {10.1145/2287016.2287026},
  isbn         = {978-1-4503-1343-8},
  url          = {},

Other publications in the database

» Simone Leo
» Luca Pireddu
» Gianluigi Zanetti