OptiStore: An On-Demand Data Processing Middleware for Very Large Scale Interactive Visualization
Authors: Zhang, C.
Publication: Submitted as partial fulfillment of the requirements of the degree of Doctor of Philosophy in Computer Science, Graduate College, University of Illinois, Chicago, IL
OptiStore is an on-demand data processing middleware for extremely large scale interactive visualization applications. It aims to develop a data processing service system that bridges the gap between the size of the very large datasets and the performance of interactive high-speed parallel visualization applications in the context of OptIPuter. Compared with the predominant strategy by preprocessing data on data repository before visualization, OptiStore processes the data on-demand and interactively so as to minimize the need to manage extraneous pre-processed copies of the data that will become a major problem as scientists continue to amass vast amounts of data.
In the architecture of OptIPuter, the distributed components, such as rendering clusters, data storage clusters and computation clusters are inter-connected by wide area optical networks. Hence the data that the visualization cluster demands at one site may be stored at other sites on different remote data storage clusters. The goals of OptiStore are to help the visualization users to access large amount of data (from terabytes to petabytes) on remote locations, query them on the distributed servers, transfer them among OptIPuter components, and filter and transform them from one data model to another in near real-time. Furthermore, OptiStore is an extensible middleware framework, into which more new data structures and filters can be integrated.
In order to address the issues of scalability of data size, interactivity in data exploration and flexibility of data filter deployment, I proposed the following techniques in this dissertation: load-balancing data partition and organization, multi-resolution analysis, view-dependent data selection, runtime data preprocessing and dedicated parallel data filtering.
To achieve high overall utilization and reduce latency cost, we developed a load-balancing data partition and organization mechanism. To ensure the scalability with the size of the datasets, the multi-resolution analysis and view-dependent culling were applied for processing the necessary data in the view of the visualization application. To take advantage of the increasing network bandwidth, we decoupled the data filter from visualization applications and data repository servers by transferring the bulk of data through the high-speed network infrastructure. By separating the data filter services from other processes in the distributed visualization pipeline, the data providers can maintain and share the data repository with less effort, and users can explore more large datasets available on the LambdaGrid and deploy their own filters flexibly. We developed a novel caching algorithm and a prediction model for prefetching and preprocessing to minimize the data access latency to meet the requirement of interactive visualization.
Date: May 1, 2008
Document: View PDF