UnO: Unsupervised Occupancy Fields for Perception and Forecasting

CVPR 2024 (Oral)

Sergio Casas*, Ben Agro*, Jiageng Mao*十, Thomas Gilles, Alexander Cui十, Thomas Li, Raquel Urtasun

Video

PDF

Perceiving the world and forecasting its future state is a critical task for self-driving. Supervised approaches leverage annotated object labels to learn a model of the world --- traditionally with object detections and trajectory predictions, or temporal bird's-eye-view (BEV) occupancy fields. However, these annotations are expensive and typically limited to a set of predefined categories that do not cover everything we might encounter on the road. Instead, we learn to perceive and forecast a continuous 4D (spatio-temporal) occupancy field with self-supervision from LiDAR data. This unsupervised world model can be easily and effectively transferred to downstream tasks. We tackle point cloud forecasting by adding a lightweight learned renderer and achieve state-of-the-art performance in Argoverse 2, nuScenes, and KITTI. To further showcase its transferability, we fine-tune our model for BEV semantic occupancy forecasting and show that it outperforms the fully supervised state-of-the-art, especially when labelled data is scarce. Finally, when compared to prior state-of-the-art on spatio-temporal geometric occupancy prediction, our 4D world model achieves a much higher recall of objects from classes relevant to self-driving.

Video

Foundation Models for the Real World

Motivation

Scalable architectures and algorithms (e.g., transformers)
Large and diverse datasets to feed to those algorithms (e.g., data scraped from the internet)
Accelerated and industrial scale computing to train these models.

Capture the 3D geometry of the world
Understand dynamics and be able to forecast future states of the world
Generalize to rare and safety critical situations
Be able to run in real-time to facilitate real-world applications
Be transferable to a wide variety of downstream tasks like object detection and trajectory forecasting

LiDAR data provides explicit 3D geometric information and as such it presents a promising avenue for building foundation models of the physical world. However, prior worlds like 4D occ struggle with forecasting and generalization, while more performant methods like our Copilot4D are too slow for real time inference; neither demonstrating adaptability to downstream tasks. As LiDAR is the primary modality for level 4 self-driving vehicles, in this work we focus on building a LiDAR foundation model which meets our above five criteria.

Challenges

Modeling the physical world through LiDAR data brings challenges not encountered with language. Firstly, while language is low dimensional and naturally represented with discrete tokens, sensor data is both high dimensional and continuous in space and time:

As such, we cannot apply the recipes from LLMs directly to our task. Second, while text generation is directly useful for a variety of applications, it is not immediately clear how to use a foundation model that predicts sensor data to enhance a self driving system:

Idea

Alternatively, we propose to model the world with a spatio-temporal occupancy field, which specifies the probability a particular location in space and future time is occupied.

Paralleling classification over language tokens, occupancy allows us to learn a classification task instead of regressing point clouds directly.
Occupancy abstracts away the specifics of the LiDAR sensors and material properties that are hard to learn.
Occupancy is employed by many algorithms in robotics and autonomous driving, making it directly useful to existing stacks.

Method

Unsupervised Task

In this section we discuss how to learn 4D occupancy fields from LiDAR data. Our unsupervised task leverages the occupancy information provided by future LiDAR data. We assume we know the instantaneous position of the lidar sensor, labelled \(\color{blue}s_i\) below.

Tracing the LiDAR position to any lidar point \(\color{magenta}p_{ij}\) tells us which parts of the scene must be unoccupied \(\color{red}{R}_{ij}^{-}\) since otherwise the lidar point would have stopped earlier.

We also assume that there must be a small region of occupied space \(\color{green}\mathcal{R}_{ij}^{+}\) immediately after the LiDAR point which reflects the laser. Repeating this for many LiDAR rays allows us to describe the occupancy of the visible parts of the scene, and we can supervise forecasting because we have future sensor information available in our datasets. We train the occupancy model on binary classification of these regions in space and future time, equally sample query points from positive and negative regions during training. This task leverages the ability of neural networks to interpolate and generalize to unseen areas.

The overall training procedure is pictured above: the model takes past LiDAR sweeps as input to forecast a continuous spatio-temporal occupancy field into the future, which is supervised using the information from future LiDAR sweeps.

Architecture

Can output an occupancy probability at any continuous point in space and time. This is because each LiDAR ray is produced at a different continuous time.
Can efficiently represent large 4D scenes
Has a large receptive field to capture actor dynamics across large spatial areas, such as fast moving vehicles in a highway.

This architecture is based on our prior work ImplicitO , and we refer the interested reader there for further intuition and details. At a high level, we voxelize past LiDAR, pass it through a LiDAR encoder to produce a birds eye view feature map Z, which encodes geometric, dynamic, and semantic features of the input data. This feature map, along with a batch of query points is fed into the implicit decoder. This implicit decoder is designed with a deformable attention mechanism to increase its receptive field which aids in learning dynamics. It runs on all query points in parallel for efficiency.

Experiments

Geometric Occupancy

Below we visualize UnO’s occupancy at the present time, unrolled across an entire sequence from the urban driving dataset Argoverse 2. We show two views: the perspective view has occupancy coloured by z in the ego frame, while the first person view has occupancy coloured by depth. The camera images are provided for reference, and not used as input.

0:000:00

We observe that UnO can capture all objects in the scene, including vulnerable road users like the cyclist. In this highway driving scenario collected from one of our trucks, UnO is able to perceive a couch on the highway, which it has never seen during training, an impressive example of generalization.

0:000:00

different vehicle platforms with different sensor setups (passenger car in the argoverse dataset, autonomous truck in the highway dataset)
different driving types; urban vs highway

Below we visualize 4D occupancy forecasts of UnO and 4D-Occ. Unlike previous visualizations, here we show what UnO forecasts into the future. The ground truth future point cloud is provided for reference.

0:000:00

We see that UnO captures dynamic actors like the turning vehicle, while 4D occ struggles to generalize beyond the point cloud input into 4D occupancy and predicts all actors to be static. In this example, UnO accurately predicts the lane change of a vehicle.

0:000:00

Depth rendering: given a lidar ray, the model predicts occupancy values along that ray, and those occupancy values are trained with a nerf-like depth loss against the ground truth lidar point depth.
Free space rendering: given a lidar ray, the model predicts occupancy values along that ray, and we take the cumulative maximum occupancy values along the ray, and supervise that with binary cross entropy against the ground truth visibility information provided by the lidar point observation.
Unbalanced UnO: we use the same objective as UnO, but do not balance the positive and negative query points used during training.

UnO outperforms these alternative occupancy-based pre-training procedures, achieving much higher geometric occupancy recall across all actor classes.

This is supported by quantitative and qualitative comparisons, which show that

0:000:00

Depth rendering hallucinates some objects, struggles with object extent, and does not capture the motion of vehicles
Free-space rendering has poor motion predictions
Unbalanced UnO is under-confident, even on static background areas,and suffers from disappearing occupancy for moving actors.

LiDAR Point Cloud Forecasting

The model needs to either predict or encode the future sensor location in order to perform raytracing.
The model needs to memorize the intrinsics specific to the lidar sensor
The model needs to understand the reflectance and other material properties of objects in the scene.

The learned renderer takes a query ray as input, and queries UnO for occupancy values along that ray. These occupancy values along with positional encodings are fed into a small MLP to regress the LiDAR point depth along that ray, which is supervised with an \(\ell_1\) loss against the ground truth depth. By querying this learned renderer based on some known LiDAR sensor intrinsics, we can transfer UnO to the task of point cloud forecasting. Quantitatively, UnO significantly outperforms contemporary point cloud forecasting methods across a diverse range of datasets.

And we emerged as the best model in the CVPR 2024 Argoverse 2 lidar forecasting challenge .

Compared to the state of the art, UnO can more accurately predicts point clouds on moving objects like the turning vehicle in this example:

0:000:00

Semantic Bird’s Eye View Occupancy Forecasting

To transfer UnO to the task of semantic BEV occupancy forecasting, we start with UnO as pre-trained weights. Because the occupancy targets lie in the x-y plane, we replace z in the query points with a learned value, and fine-tune the model to forecast semantic occupancy labels with a binary cross entropy loss. In the graph below, we plot semantic occupancy forecasting accuracy as a function of the number of available training examples.

We find that UnO’s unsupervised pre-training procedure allows it to outperform contemporary methods ImplicitO and MP3 when trained with an order of magnitude less labeled data. These methods are trained only with labeled occupancy data. At all levels of supervision, UnO brings a significant boost to performance. This is because UnO’s pretraining supervises all areas of the scene, including map topology and interactions with other non-vehicle actors, which allows it to better forecast vehicle dynamics. Additionally, for each alternative pre-training objective we discussed earlier, we fine-tuned the model using the same procedure as for UnO. We find that UnO’s pre-training provides the best downstream semantic occupancy forecasting performance.

Compared to contemporary approaches to BEV semantic occupancy forecasting, UnO is particularly better at forecasting occupancy for relatively rare occurrences, like the large articulated vehicle.

0:000:00

We note that ImplicitO uses the same architecture as UnO, but without the unsupervised pre-training. UnO’s pre-training allows it to understand the road topology and how vehicles move with respect to that. While ImplicitO , our previous state of the art BEV occupancy model, is uncertain about the intent of the turning actor, UnO correctly forecasts the turn.

0:000:00

Conclusion

is adaptable to various downstream tasks, like LiDAR forecasting and semantic occupancy forecasting,
can generalize to new scenarios and vehicle platforms,
and has an efficient architecture capable of real time inference on the edge,

BibTeX

@inproceedings{agro2024uno,
    title     = {UnO: Unsupervised Occupancy Fields for Perception and Forecasting},
    author    = {Agro, Ben and Sykora, Quin and Casas, Sergio and Gilles, Thomas and Urtasun, Raquel},
    booktitle = {CVPR},
    year      = {2024},
    }

Other Research

View all

CoRL 2025

FOMO-3D: Using Vision Foundation Models for Long-Tailed 3D Object Detection

Anqi Joyce Yang*, James Tu*, Nikita Dvornik, Enxu Li, Raquel Urtasun

CVPR 2025

DIO: Decomposable Implicit 4D Occupancy-Flow World Model

Christopher Diehl*, Quinlan Sykora*, Ben Agro, Thomas Gilles, Sergio Casas, Raquel Urtasun

CVPR 2025

MAD: Memory-Augmented Detection of 3D Objects

Ben Agro, Sergio Casas, Patrick Wang, Thomas Gilles, Raquel Urtasun

CVPR 2025

GenAssets: Generating in-the-wild 3D Assets in Latent Space

Ze Yang, Jingkang Wang, Haowei Zhang, Sivabalan Manivasagam, Yun Chen, Raquel Urtasun

SaLF: Sparse Local Fields for Multi-Sensor Rendering in Real-Time

Yun Chen*, Matthew Haines*十, Jingkang Wang, Krzysztof Baron-Lis, Sivabalan Manivasagam, Ze Yang, Raquel Urtasun

ECCV 2024

UniCal: Unified Neural Sensor Calibration

Ze Yang, George Chen, Haowei Zhang, Kevin Ta, Ioan Andrei Bârsan, Daniel Murphy, Sivabalan Manivasagam, and Raquel Urtasun

ECCV 2024

Learning to Drive via Asymmetric Self-Play

Chris Zhang, Sourav Biswas, Kelvin Wong, Kion Fallah, Lunjun Zhang, Dian Chen, Sergio Casas, Raquel Urtasun

ECCV 2024

G3R: Gradient Guided Generalizable Reconstruction

Yun Chen*, Jingkang Wang*, Ze Yang, Sivabalan Manivasagam, Raquel Urtasun

ECCV 2024

DeTra: A Unified Model for Object Detection and Trajectory Forecasting

Sergio Casas*, Ben Agro*, Jiageng Mao*十, Thomas Gilles, Alexander Cui十, Thomas Li, Raquel Urtasun

View all