CRS4

Scaling deep learning data management with Cassandra DB

Francesco Versaci, Giovanni Busonera
2021 IEEE International Conference on Big Data (Big Data) - december 2021
Télécharger la publication : cassandra-ml.pdf [935Ko]  
Deep learning (DL) algorithms require, to be fully effective, harvesting an increasingly large amount of data. These data, typically organized as millions of small files, stress filesystems and are difficult to manage. In fact, despite the huge development of DL tools and specialized hardware, data loading pipeline for DL still lacks behind in ease of use, standardization and scalability. In this work we try to rethink the data loading pipeline, by leveraging NoSQL DBs for storing both data and metadata, making them efficiently available through the network, and allowing easier data distribution for parallel DL training. We present our open-source, Apache Cassandra-based data loader and illustrate its use and performance, which enable easy and efficient data management and decentralized data distribution for parallel learning applications.

Références BibTex

@InProceedings{VB21,
  author       = {Versaci, F. and Busonera, G.},
  title        = {Scaling deep learning data management with Cassandra DB},
  booktitle    = {2021 IEEE International Conference on Big Data (Big Data)},
  series       = {International Conference on Big Data (Big Data)},
  month        = {december},
  year         = {2021},
  publisher    = {IEEE},
  keywords     = {Deep learning, Data management, NoSQL DB},
  doi          = {10.1109/BigData52589.2021.9672005},
  isbn         = {978-1-6654-3902-2},
  url          = {https://publications.crs4.it/pubdocs/2021/VB21},
}

Autres publications dans la base

» Francesco Versaci
» Giovanni Busonera