|   | 
				
					
	
		  | 
	 
	
		| Paper: | 
		Towards Large-scale RoI Indexing for Content-aware  Data Discovery | 
	 
	
		| Volume: | 
		522, Astronomical Data Analysis Software and Systems XXVII | 
	 
	
		| Page: | 
		57 | 
	 
	
		| Authors: | 
		Araya, M.; Caceres, R.; Gutierrez, L.; Mendoza, M.; Ponce, C.; Valenzuela, C. | 
	 
	
	
		| Abstract: | 
		Data discovery within large archives is a key issue for modern astronomy:
 multi-source, multi-wavelength, multi-instrument and large-scale verifications
 need proper data discovery tools for filtering the very large datasets of
 observations available nowadays.  The Virtual Observatory and file format
 standards have contributed to allow data discovery at the metadata level, where
 the filtering is circumscribed to what was explicitly annotated at the
 observation, calibration or data reduction stages. The next step is to perform
 data discovery at the content level, where content descriptors are automatically
 gathered from the observations to perform content-aware search. In a very
 general sense, this corresponds to automatically generate catalogs from large
 and diverse datasets. In this work, we consider the public spectroscopic data
 products from ALMA (fits cubes), and we apply the fast Region of Interest Seek
 and Extraction algorithm (RoiSE) to obtain content-descriptors of the spatial
 forms, positions, intensities and wavelengths of the source emissions. Despite
 the efficiency of the algorithm, it is impractical to process all the data in a
 batch/sequential manner.  Then, the problem was to decide the tools and
 architecture to use for the task distribution across the datacenter. Between the
 several distributed/parallel computing alternatives, we selected the Dask
 packages to build the distributed pipeline that we outline in this paper, mainly
 because the current RoiSE implementation is written in Python. The main
 challenge of this pipeline is the diversity of data products: different
 resolutions, signal-to-noise ratios, densities, morphologies, imaging
 parameters, etc. Therefore, we include an adaptive parameter tuning mechanism to
 cope with this diversity. Finally, we present an example of content-aware data
 discovery over the obtained database. | 
	 
	
		| 
			
			
		 | 
	 
	
		  | 
	 
 
					 
				 | 
				  |