Joseph A. Insley, Daniel J. Sandin, and Thomas A. DeFanti
To acquire these images the person stands on a turntable, placed in front of a blue screen at a distance of ten feet from a video camera, which is at eye level. We preview the images and set the levels of the various parameters which are to be used in calculating the chroma key to drop out the background. Once these values are set, we record one revolution of the person turning on the turntable and capture the chroma key settings. Recording at 30fps for the 15 seconds required for one revolution results in a movie file consisting of 450 frames. A configuration file can later be used to specify the maximum number of images that should be used, as well as other pertinent information. This allows individual users to adjust parameters based on the specific hardware and memory capacities of the machine they are running on.
From this series of images, a different image is selected to represent the person to each eye. Which two images are to be used is calculated based on the positions of both the local and remote users in the space, and the direction that the remote user is facing. These images are then texture mapped onto two seperate polygons, each of which is rotated around the vertical axis toward the eye which is meant to see it.
As we are not actually creating a three-dimensional model, but using two two-dimensional images to represent the avatars, not all depth cues are supported. Among those supported are convergence, binocular disparity, horizontal motion parallax, and occlusion. The proper perspective of the VideoAvatar is maintained in relation to the other objects in the modeled environment.
However, this does not hold true for vertical motion parallax, since the images were captured from a fixed distance at a fixed height. For this same reason, proper perspective within the model is not maintained. The correct perspective is achieved when the avatar is viewed from the same distance and height used in recording.
One of the more attractive features of the VideoAvatar Library is its ease of use. It utilizes the networking services of the CAVE Library, which provides the position and orientation of the head and hand of all networked users. This allows VideoAvatars to be added to existing CAVE programs by simply inserting a few function calls. They can also be used to represent other objects or agents whose position and orientation are supplied to the library by the programmer.
The VideoAvatars provide high quality, easily recognizable people for virtual environments. Because the people are generally static when they are recorded, they appear rather statue-esque, but provide significant information about where other users are located in the environment and where they are looking.
Since the images are static, they are unable to take advantage of some of the tracking information, which could be used to communicate through simple gestures such as nodding and pointing. Some life can be added to the avatars by having the person be in motion while they are being recorded. However, this could become problematic as the disparity between the two images used can often times be too great for fusion to take place. It also means that the position within the model is dependent upon the angle from which it is viewed. When networked audio is employed, the static images also fail to provide an indication of who is doing the talking.
Despite the limitations of using image-based versus model-based rendering techniques for the VideoAvatars, the results are stunningly realistic. The goal is to supply a range of avatars with varing capabilities. Future versions of the library will encorporate multiple positions within the avatar model and real-time live video in order to fulfill this goal.
The authors wish to acknowledge Dave Pape, Javier I. Girado, and Terry Franguiadakis for their valuable input to this research.