
DIO: Decomposable Implicit 4D Occupancy-Flow World Model
Christopher Diehl*, Quinlan Sykora*, Ben Agro, Thomas Gilles, Sergio Casas, Raquel Urtasun
CVPR 2024 (Oral)
Sergio Casas*, Ben Agro*, Jiageng Mao*十, Thomas Gilles, Alexander Cui十, Thomas Li, Raquel Urtasun
Scalable architectures and algorithms (e.g., transformers)
Large and diverse datasets to feed to those algorithms (e.g., data scraped from the internet)
Accelerated and industrial scale computing to train these models.
Capture the 3D geometry of the world
Understand dynamics and be able to forecast future states of the world
Generalize to rare and safety critical situations
Be able to run in real-time to facilitate real-world applications
Be transferable to a wide variety of downstream tasks like object detection and trajectory forecasting
LiDAR data provides explicit 3D geometric information and as such it presents a promising avenue for building foundation models of the physical world. However, prior worlds like 4D occ struggle with forecasting and generalization, while more performant methods like our Copilot4D are too slow for real time inference; neither demonstrating adaptability to downstream tasks. As LiDAR is the primary modality for level 4 self-driving vehicles, in this work we focus on building a LiDAR foundation model which meets our above five criteria.
Modeling the physical world through LiDAR data brings challenges not encountered with language. Firstly, while language is low dimensional and naturally represented with discrete tokens, sensor data is both high dimensional and continuous in space and time:
As such, we cannot apply the recipes from LLMs directly to our task. Second, while text generation is directly useful for a variety of applications, it is not immediately clear how to use a foundation model that predicts sensor data to enhance a self driving system:
Alternatively, we propose to model the world with a spatio-temporal occupancy field, which specifies the probability a particular location in space and future time is occupied.
Paralleling classification over language tokens, occupancy allows us to learn a classification task instead of regressing point clouds directly.
Occupancy abstracts away the specifics of the LiDAR sensors and material properties that are hard to learn.
Occupancy is employed by many algorithms in robotics and autonomous driving, making it directly useful to existing stacks.
In this section we discuss how to learn 4D occupancy fields from LiDAR data. Our unsupervised task leverages the occupancy information provided by future LiDAR data. We assume we know the instantaneous position of the lidar sensor, labelled \(\color{blue}s_i\) below.
Tracing the LiDAR position to any lidar point \(\color{magenta}p_{ij}\) tells us which parts of the scene must be unoccupied \(\color{red}{R}_{ij}^{-}\) since otherwise the lidar point would have stopped earlier.
We also assume that there must be a small region of occupied space \(\color{green}\mathcal{R}_{ij}^{+}\) immediately after the LiDAR point which reflects the laser. Repeating this for many LiDAR rays allows us to describe the occupancy of the visible parts of the scene, and we can supervise forecasting because we have future sensor information available in our datasets. We train the occupancy model on binary classification of these regions in space and future time, equally sample query points from positive and negative regions during training. This task leverages the ability of neural networks to interpolate and generalize to unseen areas.
The overall training procedure is pictured above: the model takes past LiDAR sweeps as input to forecast a continuous spatio-temporal occupancy field into the future, which is supervised using the information from future LiDAR sweeps.
Can output an occupancy probability at any continuous point in space and time. This is because each LiDAR ray is produced at a different continuous time.
Can efficiently represent large 4D scenes
Has a large receptive field to capture actor dynamics across large spatial areas, such as fast moving vehicles in a highway.
This architecture is based on our prior work ImplicitO , and we refer the interested reader there for further intuition and details. At a high level, we voxelize past LiDAR, pass it through a LiDAR encoder to produce a birds eye view feature map Z, which encodes geometric, dynamic, and semantic features of the input data. This feature map, along with a batch of query points is fed into the implicit decoder. This implicit decoder is designed with a deformable attention mechanism to increase its receptive field which aids in learning dynamics. It runs on all query points in parallel for efficiency.
Below we visualize UnO’s occupancy at the present time, unrolled across an entire sequence from the urban driving dataset Argoverse 2. We show two views: the perspective view has occupancy coloured by z in the ego frame, while the first person view has occupancy coloured by depth. The camera images are provided for reference, and not used as input.
We observe that UnO can capture all objects in the scene, including vulnerable road users like the cyclist. In this highway driving scenario collected from one of our trucks, UnO is able to perceive a couch on the highway, which it has never seen during training, an impressive example of generalization.
different vehicle platforms with different sensor setups (passenger car in the argoverse dataset, autonomous truck in the highway dataset)
different driving types; urban vs highway
Below we visualize 4D occupancy forecasts of UnO and 4D-Occ. Unlike previous visualizations, here we show what UnO forecasts into the future. The ground truth future point cloud is provided for reference.
We see that UnO captures dynamic actors like the turning vehicle, while 4D occ struggles to generalize beyond the point cloud input into 4D occupancy and predicts all actors to be static. In this example, UnO accurately predicts the lane change of a vehicle.
Depth rendering: given a lidar ray, the model predicts occupancy values along that ray, and those occupancy values are trained with a nerf-like depth loss against the ground truth lidar point depth.
Free space rendering: given a lidar ray, the model predicts occupancy values along that ray, and we take the cumulative maximum occupancy values along the ray, and supervise that with binary cross entropy against the ground truth visibility information provided by the lidar point observation.
Unbalanced UnO: we use the same objective as UnO, but do not balance the positive and negative query points used during training.
UnO outperforms these alternative occupancy-based pre-training procedures, achieving much higher geometric occupancy recall across all actor classes.
This is supported by quantitative and qualitative comparisons, which show that
Depth rendering hallucinates some objects, struggles with object extent, and does not capture the motion of vehicles
Free-space rendering has poor motion predictions
Unbalanced UnO is under-confident, even on static background areas,and suffers from disappearing occupancy for moving actors.
The model needs to either predict or encode the future sensor location in order to perform raytracing.
The model needs to memorize the intrinsics specific to the lidar sensor
The model needs to understand the reflectance and other material properties of objects in the scene.
The learned renderer takes a query ray as input, and queries UnO for occupancy values along that ray. These occupancy values along with positional encodings are fed into a small MLP to regress the LiDAR point depth along that ray, which is supervised with an \(\ell_1\) loss against the ground truth depth. By querying this learned renderer based on some known LiDAR sensor intrinsics, we can transfer UnO to the task of point cloud forecasting. Quantitatively, UnO significantly outperforms contemporary point cloud forecasting methods across a diverse range of datasets.
And we emerged as the best model in the CVPR 2024 Argoverse 2 lidar forecasting challenge .
Compared to the state of the art, UnO can more accurately predicts point clouds on moving objects like the turning vehicle in this example:
To transfer UnO to the task of semantic BEV occupancy forecasting, we start with UnO as pre-trained weights. Because the occupancy targets lie in the x-y plane, we replace z in the query points with a learned value, and fine-tune the model to forecast semantic occupancy labels with a binary cross entropy loss. In the graph below, we plot semantic occupancy forecasting accuracy as a function of the number of available training examples.
We find that UnO’s unsupervised pre-training procedure allows it to outperform contemporary methods ImplicitO and MP3 when trained with an order of magnitude less labeled data. These methods are trained only with labeled occupancy data. At all levels of supervision, UnO brings a significant boost to performance. This is because UnO’s pretraining supervises all areas of the scene, including map topology and interactions with other non-vehicle actors, which allows it to better forecast vehicle dynamics. Additionally, for each alternative pre-training objective we discussed earlier, we fine-tuned the model using the same procedure as for UnO. We find that UnO’s pre-training provides the best downstream semantic occupancy forecasting performance.
Compared to contemporary approaches to BEV semantic occupancy forecasting, UnO is particularly better at forecasting occupancy for relatively rare occurrences, like the large articulated vehicle.
We note that ImplicitO uses the same architecture as UnO, but without the unsupervised pre-training. UnO’s pre-training allows it to understand the road topology and how vehicles move with respect to that. While ImplicitO , our previous state of the art BEV occupancy model, is uncertain about the intent of the turning actor, UnO correctly forecasts the turn.
is adaptable to various downstream tasks, like LiDAR forecasting and semantic occupancy forecasting,
can generalize to new scenarios and vehicle platforms,
and has an efficient architecture capable of real time inference on the edge,
@inproceedings{agro2024uno,
title = {UnO: Unsupervised Occupancy Fields for Perception and Forecasting},
author = {Agro, Ben and Sykora, Quin and Casas, Sergio and Gilles, Thomas and Urtasun, Raquel},
booktitle = {CVPR},
year = {2024},
}
Christopher Diehl*, Quinlan Sykora*, Ben Agro, Thomas Gilles, Sergio Casas, Raquel Urtasun
Ben Agro, Sergio Casas, Patrick Wang, Thomas Gilles, Raquel Urtasun
Ze Yang, Jingkang Wang, Haowei Zhang, Sivabalan Manivasagam, Yun Chen, Raquel Urtasun
Yun Chen*, Matthew Haines*十, Jingkang Wang, Krzysztof Baron-Lis, Sivabalan Manivasagam, Ze Yang, Raquel Urtasun
UniCal: Unified Neural Sensor Calibration
Chris Zhang, Sourav Biswas, Kelvin Wong, Kion Fallah, Lunjun Zhang, Dian Chen, Sergio Casas, Raquel Urtasun
Yun Chen*, Jingkang Wang*, Ze Yang, Sivabalan Manivasagam, Raquel Urtasun
Sergio Casas*, Ben Agro*, Jiageng Mao*十, Thomas Gilles, Alexander Cui十, Thomas Li, Raquel Urtasun
Jack Lu†*, Kelvin Wong*, Chris Zhang, Simon Suo, Raquel Urtasun