UnO: Unsupervised Occupancy Fields for Perception and Forecasting

CVPR 2024 (Oral)

Sergio Casas*, Ben Agro*, Jiageng Mao*十, Thomas Gilles, Alexander Cui十, Thomas Li, Raquel Urtasun


Perceiving the world and forecasting its future state is a critical task for self-driving. Supervised approaches leverage annotated object labels to learn a model of the world --- traditionally with object detections and trajectory predictions, or temporal bird's-eye-view (BEV) occupancy fields. However, these annotations are expensive and typically limited to a set of predefined categories that do not cover everything we might encounter on the road. Instead, we learn to perceive and forecast a continuous 4D (spatio-temporal) occupancy field with self-supervision from LiDAR data. This unsupervised world model can be easily and effectively transferred to downstream tasks. We tackle point cloud forecasting by adding a lightweight learned renderer and achieve state-of-the-art performance in Argoverse 2, nuScenes, and KITTI. To further showcase its transferability, we fine-tune our model for BEV semantic occupancy forecasting and show that it outperforms the fully supervised state-of-the-art, especially when labelled data is scarce. Finally, when compared to prior state-of-the-art on spatio-temporal geometric occupancy prediction, our 4D world model achieves a much higher recall of objects from classes relevant to self-driving.

Video


Foundation Models for the Real World

Motivation

  1. Scalable architectures and algorithms (e.g., transformers)

  2. Large and diverse datasets to feed to those algorithms (e.g., data scraped from the internet)

  3. Accelerated and industrial scale computing to train these models.

  1. Capture the 3D geometry of the world

  2. Understand dynamics and be able to forecast future states of the world

  3. Generalize to rare and safety critical situations

  4. Be able to run in real-time to facilitate real-world applications

  5. Be transferable to a wide variety of downstream tasks like object detection and trajectory forecasting

LiDAR data provides explicit 3D geometric information and as such it presents a promising avenue for building foundation models of the physical world. However, prior worlds like 4D occ struggle with forecasting and generalization, while more performant methods like our Copilot4D are too slow for real time inference; neither demonstrating adaptability to downstream tasks. As LiDAR is the primary modality for level 4 self-driving vehicles, in this work we focus on building a LiDAR foundation model which meets our above five criteria.

Challenges

Modeling the physical world through LiDAR data brings challenges not encountered with language. Firstly, while language is low dimensional and naturally represented with discrete tokens, sensor data is both high dimensional and continuous in space and time:

As such, we cannot apply the recipes from LLMs directly to our task. Second, while text generation is directly useful for a variety of applications, it is not immediately clear how to use a foundation model that predicts sensor data to enhance a self driving system:

Idea

Alternatively, we propose to model the world with a spatio-temporal occupancy field, which specifies the probability a particular location in space and future time is occupied.

  • Paralleling classification over language tokens, occupancy allows us to learn a classification task instead of regressing point clouds directly.

  • Occupancy abstracts away the specifics of the LiDAR sensors and material properties that are hard to learn.

  • Occupancy is employed by many algorithms in robotics and autonomous driving, making it directly useful to existing stacks.


Method

Unsupervised Task

In this section we discuss how to learn 4D occupancy fields from LiDAR data. Our unsupervised task leverages the occupancy information provided by future LiDAR data. We assume we know the instantaneous position of the lidar sensor, labelled \(\color{blue}s_i\) below.

Tracing the LiDAR position to any lidar point \(\color{magenta}p_{ij}\) tells us which parts of the scene must be unoccupied \(\color{red}{R}_{ij}^{-}\) since otherwise the lidar point would have stopped earlier.

We also assume that there must be a small region of occupied space \(\color{green}\mathcal{R}_{ij}^{+}\) immediately after the LiDAR point which reflects the laser. Repeating this for many LiDAR rays allows us to describe the occupancy of the visible parts of the scene, and we can supervise forecasting because we have future sensor information available in our datasets. We train the occupancy model on binary classification of these regions in space and future time, equally sample query points from positive and negative regions during training. This task leverages the ability of neural networks to interpolate and generalize to unseen areas.

The overall training procedure is pictured above: the model takes past LiDAR sweeps as input to forecast a continuous spatio-temporal occupancy field into the future, which is supervised using the information from future LiDAR sweeps.

Architecture

  1. Can output an occupancy probability at any continuous point in space and time. This is because each LiDAR ray is produced at a different continuous time.

  2. Can efficiently represent large 4D scenes

  3. Has a large receptive field to capture actor dynamics across large spatial areas, such as fast moving vehicles in a highway.

This architecture is based on our prior work ImplicitO , and we refer the interested reader there for further intuition and details. At a high level, we voxelize past LiDAR, pass it through a LiDAR encoder to produce a birds eye view feature map Z, which encodes geometric, dynamic, and semantic features of the input data. This feature map, along with a batch of query points is fed into the implicit decoder. This implicit decoder is designed with a deformable attention mechanism to increase its receptive field which aids in learning dynamics. It runs on all query points in parallel for efficiency.


Experiments

Geometric Occupancy

Below we visualize UnO’s occupancy at the present time, unrolled across an entire sequence from the urban driving dataset Argoverse 2. We show two views: the perspective view has occupancy coloured by z in the ego frame, while the first person view has occupancy coloured by depth. The camera images are provided for reference, and not used as input.

0:000:00

We observe that UnO can capture all objects in the scene, including vulnerable road users like the cyclist. In this highway driving scenario collected from one of our trucks, UnO is able to perceive a couch on the highway, which it has never seen during training, an impressive example of generalization.

0:000:00
  • different vehicle platforms with different sensor setups (passenger car in the argoverse dataset, autonomous truck in the highway dataset)

  • different driving types; urban vs highway

Below we visualize 4D occupancy forecasts of UnO and 4D-Occ. Unlike previous visualizations, here we show what UnO forecasts into the future. The ground truth future point cloud is provided for reference.

0:000:00

We see that UnO captures dynamic actors like the turning vehicle, while 4D occ struggles to generalize beyond the point cloud input into 4D occupancy and predicts all actors to be static. In this example, UnO accurately predicts the lane change of a vehicle.

0:000:00
  • Depth rendering: given a lidar ray, the model predicts occupancy values along that ray, and those occupancy values are trained with a nerf-like depth loss against the ground truth lidar point depth.

  • Free space rendering: given a lidar ray, the model predicts occupancy values along that ray, and we take the cumulative maximum occupancy values along the ray, and supervise that with binary cross entropy against the ground truth visibility information provided by the lidar point observation.

  • Unbalanced UnO: we use the same objective as UnO, but do not balance the positive and negative query points used during training.

UnO outperforms these alternative occupancy-based pre-training procedures, achieving much higher geometric occupancy recall across all actor classes.

This is supported by quantitative and qualitative comparisons, which show that

0:000:00
  • Depth rendering hallucinates some objects, struggles with object extent, and does not capture the motion of vehicles

  • Free-space rendering has poor motion predictions

  • Unbalanced UnO is under-confident, even on static background areas,and suffers from disappearing occupancy for moving actors.

LiDAR Point Cloud Forecasting

  1. The model needs to either predict or encode the future sensor location in order to perform raytracing.

  2. The model needs to memorize the intrinsics specific to the lidar sensor

  3. The model needs to understand the reflectance and other material properties of objects in the scene.

The learned renderer takes a query ray as input, and queries UnO for occupancy values along that ray. These occupancy values along with positional encodings are fed into a small MLP to regress the LiDAR point depth along that ray, which is supervised with an \(\ell_1\) loss against the ground truth depth. By querying this learned renderer based on some known LiDAR sensor intrinsics, we can transfer UnO to the task of point cloud forecasting. Quantitatively, UnO significantly outperforms contemporary point cloud forecasting methods across a diverse range of datasets.

And we emerged as the best model in the CVPR 2024 Argoverse 2 lidar forecasting challenge .

Compared to the state of the art, UnO can more accurately predicts point clouds on moving objects like the turning vehicle in this example:

0:000:00

Semantic Bird’s Eye View Occupancy Forecasting

To transfer UnO to the task of semantic BEV occupancy forecasting, we start with UnO as pre-trained weights. Because the occupancy targets lie in the x-y plane, we replace z in the query points with a learned value, and fine-tune the model to forecast semantic occupancy labels with a binary cross entropy loss. In the graph below, we plot semantic occupancy forecasting accuracy as a function of the number of available training examples.

We find that UnO’s unsupervised pre-training procedure allows it to outperform contemporary methods ImplicitO and MP3 when trained with an order of magnitude less labeled data. These methods are trained only with labeled occupancy data. At all levels of supervision, UnO brings a significant boost to performance. This is because UnO’s pretraining supervises all areas of the scene, including map topology and interactions with other non-vehicle actors, which allows it to better forecast vehicle dynamics. Additionally, for each alternative pre-training objective we discussed earlier, we fine-tuned the model using the same procedure as for UnO. We find that UnO’s pre-training provides the best downstream semantic occupancy forecasting performance.

Compared to contemporary approaches to BEV semantic occupancy forecasting, UnO is particularly better at forecasting occupancy for relatively rare occurrences, like the large articulated vehicle.

0:000:00

We note that ImplicitO uses the same architecture as UnO, but without the unsupervised pre-training. UnO’s pre-training allows it to understand the road topology and how vehicles move with respect to that. While ImplicitO , our previous state of the art BEV occupancy model, is uncertain about the intent of the turning actor, UnO correctly forecasts the turn.

0:000:00

Conclusion

  • is adaptable to various downstream tasks, like LiDAR forecasting and semantic occupancy forecasting,

  • can generalize to new scenarios and vehicle platforms,

  • and has an efficient architecture capable of real time inference on the edge,


BibTeX

@inproceedings{agro2024uno,
    title     = {UnO: Unsupervised Occupancy Fields for Perception and Forecasting},
    author    = {Agro, Ben and Sykora, Quin and Casas, Sergio and Gilles, Thomas and Urtasun, Raquel},
    booktitle = {CVPR},
    year      = {2024},
    }