Hadoop and Spark for Data Management, Processing and Analysis of Astronomical Big Data: Applicability and Performance


Paper:	Hadoop and Spark for Data Management, Processing and Analysis of Astronomical Big Data: Applicability and Performance
Volume:	512, Astronomical Data Analysis Software and Systems XXV
Page:	41
Authors:	Harischandra, L.
Abstract:	The AAT node of the All Sky Virtual Observatory (ASVO) is being built on top of Apache Hadoop and Apache Spark technologies. The Hadoop Distributed File System (HDFS) is used as the data store and Apache Spark is used as the data processing engine. The data store consists of a cluster of 4 nodes of which 3 nodes provide space for data storage and all 4 nodes can be used to gain computing power. In this paper, we compare the performance of Apache Spark on GAMA data hosted on HDFS against other relational database management systems and software in the fields of data management, processing and analysis of astronomical Big Data. We examine the usability, flexibility and extensibility of the libraries and languages available within Spark, specifically in querying and processing large amounts of heterogeneous astronomical data. The data included are primarily in tabular format but we discuss how we can leverage the rich functionalities offered by Hadoop and Spark libraries to store, process/transform and query data in other formats such as HDF5 and FITS. We will also discuss the limitations of existing relational database management systems in terms of scalability and usability. Then we evaluate the benchmark results of varying data import and transform scenarios, and the expected latency of queries across a range of complexities. Lastly, we will show how astronomers can create custom data-processing tasks in their preferred language (python, R etc.) using Spark, with limited knowledge of the Hadoop technologies.