My name is Shi Boao. My English name is Bowie. I am currently a research assistant at Penn GRASP Lab. I am also a final year undergraduate majoring in Computer Science from The University of Hong Kong.
My research interests lie in General 3D/4D, Physical intelligence and Robotics. I want to develop scalable perceptual and physical world modelling systems that can learn directly from large scale unconstrained video, arguably the most abundant and accessible form of real-world data. With this in hand, how can we enable robotics agent decision making training by simulation in scale.
Several directions/topics I am interested in:
- Perception foundation model (Visual geometry model, Feed forward 3D/4D reconstruction)
- 4D foundation model (Action conditioned world model)
- Robotics learning
My current research goal is to answer the following questions:
- How can we learn dynamics and interaction causality of the real world from large scale streaming monocular observation?
- How can we reason about the underlying intrinsic or physical property of unstructured entity representation (pixels, point clouds, tracklets, etc.)?
- How can we extract scalable and generalizable priors of world dynamics and interaction causality from human interaction and decision videos, and how to exploit it as a foundation for robotics tasks?
