CRS4

Big Data processing with Hadoop

Collana seminari interni 2012, Number 20120418 - april 2012
Download the publication : presentation.pdf [1.4Mo]  
In this seminar, we explore the Hadoop MapReduce framework and its use to solve certain types of Big Data problems. These problems, characterized by their large data set sizes, are becoming more commonplace as data acquisition rates increase in many fields of study and business, luring people by the prospects of increased analysis sensitivity. However, by definition Big Data problems are not tractable when using commonly available software and computing systems, such as the desktop workstation. As a result, they require specialized solutions that are designed to handle large quantities of data and scale across large, possibly cheap, computing infrastructure. Hadoop provides relatively low cost access to such solutions by implementing distributed computation and robustness as integral features that, therefore, do not have to be reimplemented by the application developer. Moreover, in addition to its native Java API, it also provides a high-level Python API developed right here at CRS4. As a concrete example of a Big Data solution, we briefly look at the Seal suite of distributed tools for processing high-throughput DNA sequencing data, currently used by the CRS4 Sequencing and Genotyping Platform. Finally, we discuss how Hadoop may be applied to your own Big Data problems.

BibTex references

@InProceedings{Pir12,
  author       = {Pireddu, L.},
  title        = {Big Data processing with Hadoop},
  booktitle    = {Collana seminari interni 2012},
  number       = {20120418},
  month        = {april},
  year         = {2012},
  keywords     = {distributed computing, Hadoop, large data set },
  url          = {https://publications.crs4.it/pubdocs/2012/Pir12},
}

Other publications in the database

» Luca Pireddu