OptiStore: An On-Demand Data Processing Middleware for Very Large Scale Interactive Visualization (Dissertation Defense, C. Zhang)

October 11th, 2007

Categories: MS / PhD Thesis, Networking, Software

About

OptiStore is an on-demand data processing middleware for extremely large scale interactive visualization applications.

It aims to develop a data processing service system that bridges the gap between the size of the very large datasets and the performance of interactive high-speed parallel visualization applications in the context of OptIPuter.

Compared with the predominant strategy by preprocessing data on data repository before visualization, OptiStore processes the data on-demand and interactively so as to minimize the need to manage extraneous pre-processed copies of the data that will become a major problem as scientists continue to amass vast amounts of data.

In the architecture of OptIPuter, the distributed components, such as rendering clusters, data storage clusters and computation clusters are inter-connected by wide area optical networks. Hence the data that the visualization cluster demands at one site may be stored at other sites on different remote data storage clusters.

The goal of OptiStore to help the visualization users to access large amount of data (from terabytes to petabytes) on remote locations, query them on the distributed servers, transfer them among OptIPuter components, and filter and transform them from one data model to another in near real-time.

Furthermore, OptiStore is an extensible middleware framework, into which more new data structures and filters can be integrated.

In order to address the issues of scalability of data size, interactivity in data exploration and flexibility of data filter deployment, in this dissertation, the following techniques have been proposed: load-balancing data partition and organization, multi-resolution analysis, view-dependent data selection, runtime data preprocessing and dedicated parallel data filtering.

To achieve high overall utilization and reduce latency cost, we developed a load-balancing data partition and organization mechanism.

To ensure the scalability with the size of the datasets, the multi-resolution analysis and view-dependent culling were applied for processing the necessary data in the view of the visualization application.

To take advantage of the increasing network bandwidth, we decoupled the data filter from visualization applications and data repository servers by transferring bulk of data through the high-speed network infrastructure.

By separating the data filter services from other processes in the distributed visualization pipeline, the data providers can maintain and share the data repository with less effort; and users can explore more large datasets available on the LambdaGrid and deploy their own filter flexibly.

We developed a novel caching algorithm and a prediction model for pre-fetching and preprocessing to minimize the data access latency to meet the requirement of interactive visualization.