Towards an Empirical Evaluation of Scientific Data Indexing and Querying
Keywords:
computational fluid dynamics, dataflow management, scientific data indexing, scientific data queryingAbstract
Computational simulations usually produce large amounts of data on a regular time-step basis. Heterogeneous simulation outputs are stored in different file formats and on distinct storage devices. Therefore, the main challenges for accessing simulation data are related to time-to-query, which is the effort spent for setting all data into a common framework and, only then, issue a high-level query statement. The simulation data loading into DataBase Management Systems (DBMS) are either unpractical, as they demand a prohibitive time for data preparation, or unfeasible, as data files are still needed in its original form (scientific applications still need to read and write contents to those files).
In this article, we discuss the complementary approaches of \textit{adaptive querying} and raw data file indexing for accessing simulation results stored in multiple sources (e.g., raw data files) without data loading. In particular, we review (i) NoDB PostgresRAW routines for adaptive query processing, and (ii) FastBit and FastQuery methods for raw data file indexing and querying.
We examine the behavior of both strategies regarding a real case study of computational fluid dynamics simulation in the domain of sediment deposition.
In this experimental evaluation, we measured the elapsed time for index construction and query processing regarding six distinct query categories over 62 time steps, which sums up to different 372 queries on 44,160 files (12.2 GB) produced by the computational simulation. Results show that FastBit/FastQuery is faster than PostgresRAW for query execution in all but low-selectivity query scenarios. On the other hand, results also show that queries in PostgresRAW have a reduced accumulated time-to-query in comparison to FastBit/FastQuery.