August 15th, 2014
In this thesis, I present a novel genome data visualization targeting an important area of genomics research: comparative bacterial gene neighborhood analysis. This approach demands scalable visualization designs that accommodate simultaneous comparison of hundreds of genomes at once.
Decreases in genome sequencing costs have driven a proliferation in the volume of genomic data. Thousands of complete genome sequences have been compiled on public databases and even larger volumes of data have been generated privately by independent research groups. In particular, rates of bacterial genome sequencing have accelerated, due to low sequencing costs. The comparative analysis of these genomes provides a new approach to the study of novel proteins and protein interactions. While automated analysis plays a significant role in the generation and analysis of this data, visualization is needed to bring experts into the data-mining loop, to verify the results of automated analysis and to detect patterns that are difficult to find through computation alone.
At the same time, advances in display hardware have enabled similarly rapid growth in display resolution and the development of large, high-resolution environments. These environments present an opportunity to visualize big data in new ways and better integrate expert judgment in the computational analysis of big data. Recent research suggests that big displays present benefits to visualization designers and enable the analysis of complex data sets. However, more research is needed to understand the design decisions, such as perceptually scalable design, that best take advantage of these benefits.
Genomic data visualizations have largely failed to keep pace with this growth in genomic data generation and display resolution. Existing visualizations are not designed to enable the comparison of more than a few genomes at once, and are built to work on moderate to low resolution environments. While it might seem reasonable to simply ’scale-up’ these visualizations to fill available screen space, in many cases the design of these approaches do not work on big displays.
In this thesis, I present Bacterial Gene Neighborhood Investigation Environment, or BactoGeNIE, a new comparative gene neighborhood visualization designed to address large volumes of bacterial genome sequences and explore the design decisions that best take advantage of large, high-resolution environments. I will describe the design of this approach and will characterize the ways in which this design scales to large numbers of comparisons, is suited to high-resolution environments, and adopts perceptually scalable encodings. My high-density genome data visualization approach relies on interactive visual queries to transform large data volumes into high-resolution comparative genomic maps, that use pre-attentive visual cues to address analytic tasks for comparative gene neighborhood investigations across large volumes of complete bacterial genomes. In addition, I present a visual algorithm that transforms the data into a view which simultaneously provides detail-up-close and context-from-a distance, allowing researchers to simultaneously access data at these two scales.
The implementation of this approach is an interactive application that can run on single machine, tiled-display walls as well as high-resolution personal displays. I will describe the program design and architecture, along with several examples from visualizing draft Escherichia coli (E. coli) genomes.
Preliminary results of this work suggest the novelty and significance of this approach, as well as potential areas of extension for this work. This approach adopts encodings that scale more effectively to large displays and large data volumes, enabling the rapid performance of analytic tasks across large data volumes. This approach also enables the simultaneous comparison and analysis of hundreds to thousands of genomic sequences, which greatly exceed the volumes possible in any existing tool.
Future work in this area includes generalizing the approach to other sub-fields in comparative genomics. In addition, I will adapt this approach to be fit within an ecosystem of multiple-coordinated visualizations for tiled-display walls. I will also explore parallelization by adapting the rendering for the graphics processing unit, to achieve better interactivity.
The primary contributions of this thesis are as follows:
1) BactoGeNIE is a novel visualization design that is the first scalable visualization for comparative analysis of hundreds to thousands of gene neighborhoods. The state of the art competitor visualizations handle no more than 9 gene neighborhoods.
2) BactoGeNIE is the first interactive ’thousand-genome’ comparative visual approach on the gene neighborhood scale combining navigation, details-on-demand, contig sorting, contig density control, ‘zoom’ and application of color tags to genes of interest.
3) BactoGeNIE is the first to employ a dynamic ’gene targeting’ interaction which combines on-the-fly alignment and sorting by user selected ortholog clusters with the application of a color ramp for the target gene and contig, allowing for rapid hypothesis testing and the pre-attentive identification of commonly recurring neighbors, deletions, insertions, inversions, truncations, and potential errors in data processing.
4) BactoGeNIE is the first to unify overviews with gene neighborhood details that can be accessed through physical movement.
5) BactoGeNIE is the first gene neighborhood comparative visualization to be implemented for large, high-resolution environments.
Aurisano, J., Bacterial Gene Neighborhood Investigation Environment: A Scalable Genome Visualization for Big Displays, Thesis submitted as partial fulfillment of the requirements for the degree of Master of Science in Computer Science, Graduate College of the University of Illinois at Chicago, August 15th, 2014.