January 21st, 2003
This paper describes a cost-effective, real-time (640x480 at 30Hz) upright frontal face detector as part of an ongoing project to develop a video-based, tetherless 3D head position and orientation tracking system. The work is specifically targeted for auto-stereoscopic displays and projection-based virtual reality systems. The proposed face detector is based on a modified LAMSTAR neural network system. At the input stage, after achieving image normalization and equalization, a sub-window analyzes facial features using a neural network. The sub-window is segmented, and each part is fed to a neural network layer consisting of a Kohonen Self-Organizing Map (SOM). The output of the SOM neural networks are interconnected and related by correlation-links, and can hence determine the presence of a face with enough redundancy to provide a high detection rate. To avoid tracking multiple faces simultaneously, the system is initially trained to track only the face centered in a box superimposed on the display. The system is also rotationally and size invariant to a certain degree.
The Electronic Visualization Laboratory (EVL) is one of several research groups working on producing PC-driven, projection-based virtual reality (VR) displays. The use of high-end systems, such as EVL’s CAVE and ImmersaDesk are well established in application areas such as computational science, automotive engineering and chemical exploration. The next-generation of VR displays, both tiled LCD displays and projection-based, aim to eliminate encumbrances on the user. The trend is towards higher resolution displays where the user is not required to wear special glasses to view stereoscopic scenes. Analogously, the trend for interaction with these displays is towards lightweight and tetherless input-devices.
EVL and its collaborators are exploring the use of other modalities (such as vision, speech and gesture) as humancomputer interfaces for this new generation of VR systems. Gesture recognition can come from either tracking the user’s movements or processing them using video camera input. Gaze direction, or eye tracking, using camera input is also possible. Audio support can be used for voice recognition and generation, as well as used in conjunction with recording tele-immersive sessions. Used together, these systems enable tetherless tracking and unencumbered hand movements for improved interaction and collaboration within the virtual scene.
The research described here is a cost-effective real-time (640x480 at 30Hz) face detector that will serve as the core of a video-based, tetherless 3D head position and orientation tracking system targeted for either auto-stereoscopic displays or projection-based virtual reality systems. It will be tested first using EVL’s Access Grid Augmented / Autostereo Virtual Environment (AGAVE), a high-resolution autostereoscopic display consisting of tiled LCD displays driven by a PC cluster and fitted with a highly sensitive tracking system to track user’s gaze and gestures without the use of head mounted or hand held tracking devices.
The complete tracking system will consist of two neural network-based face detectors working in parallel, each running through four distinct phases. They are: pre-processing, which includes input masking, image normalization, histogram equalization, and image sub-sampling; detection, where a modified LAMSTAR neural network takes the pre-processed image, scans for a face and outputs the coordinates of the corresponding box surrounding the face; postprocessing, where facial feature extraction and stereo correspondence matching occurs to extract the 3D information; and the implementation of a prediction module, which is based on a neural network linear filter with a Tap Delay Line (TDL). The function of the prediction module is to inform the face detection modules where in the scene the face will likely be found, so as to avoid scanning the next whole frame for a face. If the face is not detected in the predicted position, the system will rescan the entire scene. Phases one and three use computer vision techniques. This paper addresses the pre-processing and detection phases.
Girado, J., Sandin, D., DeFanti, T., Wolf, L., Real-time Camera-based Face Detection using a Modified LAMSTAR Neural Network System, Proceedings of IS&T / SPIE’s 15th Annual Symposium Electronic Imaging 2003, Applications of Artificial Neural Networks in Image Processing V, San Jose, California, pp. 20-24, January 21st, 2003.