SNP Genotype Calling with Pydoop
Collana seminari interni 2012, Number 20120613 - june 2012
Genotype measurement is a fundamental step in genome-wide association studies (GWAS), which aim to uncover the genetic causes of diseases and other physical traits. The leading technology for measuring genotypes is the SNP microarray, where hundreds of thousands of genetic variants are interrogated simultaneously by carefully designed DNA probes. Genotype calling (GC) -- the act of converting probe intensities to discrete variants -- gets more accurate as the number of samples processed at the same time increases. However, current GC software scales poorly with respect to input size, thus limiting the possibility to harness the large number of samples available in large-scale studies.
In this talk, we will show how to use Pydoop -- our Python API for Hadoop -- to develop a distributed GC application that offers more scalability, flexibility and robustness than the current state-of-the-art. The software, currently used in production, is able to process datasets as large as 7000 samples in a single day -- more than one order of magnitude faster than previous solutions.
BibTex references
@InProceedings{Leo12a,
author = {Leo, S.},
title = {SNP Genotype Calling with Pydoop},
booktitle = {Collana seminari interni 2012},
number = {20120613},
month = {june},
year = {2012},
keywords = {bioinformatics,hadoop,python},
url = {https://publications.crs4.it/pubdocs/2012/Leo12a},
}
Other publications in the database