CCSS: Modeling and Learning Space-time Structures for Real-time Environment Perception

August 1st, 2024 - July 31st, 2027

Categories: Software, Deep Learning, Machine Learning, Data Science, Computer Vision, Image Processing

About

Developing sensing circuits and systems capable of perceiving the physical environment is an essential step toward next-generation cyber-physical systems that leverage computation, communication, and sensing integrated with the physical domain. The recent introduction of a class of artificial intelligence (AI) methods known as deep learning has drastically advanced vision-based environment perception. However, the applicability of these methods is significantly limited by their reliance on vast amounts of 3D annotations for training, which are expensive, unnatural, and often unobtainable. Acquiring 3D annotations necessitates specialized data captured by 3D sensors rather than utilizing abundant and readily available images or videos. Moreover, it requires a controlled environment or compromises with inaccurate subjective labeling. This project aims to address real-time environment perception in a very challenging, largely unexplored, yet highly practical setting, where only 2D annotations are available during training, by modeling and learning rich space-time structures of the environment. The outcomes of this project will significantly impact cyber-physical systems and facilitate a wide range of applications, from robots in manufacturing and personal services to autonomous vehicles that enhance people’s mobility and safety. Furthermore, this project will tightly integrate research and education at the University of Illinois Chicago (UIC), which is a Minority Serving Institution (MSI), through curriculum development, research training for high school, undergraduate, and graduate students, broadening the participation of female and minority students, and community outreach.

This project will address real-time environment perception when only 2D annotations are available during training by systematically pursuing a novel approach that decomposes the complex physical environment into geometry and motion substructures in 3D space and models their rich space-time interactions. From streaming video, the trained system will not only recognize objects and scene layouts in the environment but also estimate their 3D geometry and 3D motion in real-time. This project is focused on (1) establishing a self-supervised framework for lifting 2D objects to 3D, by bridging geometry and motion, brightness constancy, and differentiable depth rendering, (2) scaling 2D-to-3D lifting of individual objects to perceive the entire environment, by modeling and learning short-term and long-term dynamics of the environment, and (3) achieving flexible and generalizable environment perception by handling articulated motion and out-of-distribution environments. The new approach is expected to be efficient in terms of the amount of annotations required for training, scalable to a wide range of objects, and robust to the complexity and diversity of real-world environments. This project will advance and enrich the fundamental research of visual sensing, perception, and learning. Moreover, this project will demonstrate the usability and robustness of the proposed approach in real cyber-physical systems.