Annotation-free world discovery: With xMOD, 2D and 3D vision work together

#artificial intelligence #generative AI #vision

Figure 1: Instance masks predicted by xMOD (2D & 3D)

One of the major challenges facing computer vision is the human-annotation-free discovery and location of 3D objects. xMOD combines 2D vision (cameras) and 3D vision (LiDAR sensors)
in a novel cross-distillation method. The AI learns to segment its environment from motion cues in the images, delivering beyondstate-of-the-art performance.

Conventional object detection requires massive, manually-annotated datasets—making the process costly and difficult to upgrade or scale. We are working on an object-discovery approach that would eliminate the need for human supervision. In 2D image analysis, a group of pixels moving along a coherent trajectory is considered a single object. So far, this approach has not been explored extensively for 3D images, mainly due to the gaps in data inherent to LiDAR point clouds.

Using 2D motion, CEA-List developed an initial reference dataset for 3D object discovery (DIOD-3D), and then designed a cross-distillation training framework, xMOD, based on a symmetrical “teacher-student” architecture in which a 2D expert model and the 3D model guide each other’s learning. This reciprocal interaction harnesses the complementarity of the 2D and 3D data, compensating for each modality’s weaknesses. The camera steps in for the LiDAR sensor if there are too few data points, and the LiDAR sensor steps in for the camera in challenging conditions like low light or a lack of texture.

The approach also includes another innovation: scene completion as a pretext task for 3D discovery. The model differs from simple reconstruction in that it is trained to predict denser point clouds for a better understanding of object geometries, even when input data is sparse.

With xMOD, millions of scenes can be labeled with no human intervention. The use of both camera and LiDAR data enhances precision.

The model was tested on one synthetic dataset (TRIP-PD) and two real datasets (KITTI, Paymo), outperforming the state of the art in 2D object discovery, with performance improvements ranging from 8.7 (KITTI) to 15 (Waymo) F1
score points. Further reliability improvements can be achieved using a late fusion strategy, which entails combining the 2D and 3D predictions only during inference. This strategy proves particularly effective for objects at medium distances (10 meters to 30 meters).
These advances in object discovery pave the way for faster, more reliable design of the 3D perception models essential to automated driving systems (ADS).



Figure 2: 3D visualization of xMOD predictions (2D & 3D)

+15,1 15.1: maximum F1@50 score improvement obtained by xMOD compared to the state-of-the-art in 2D object discovery.

Embedded AI processing into smart systems and devices.

Saad Lahlali

PhD student — CEA-List

Learn more

Applications

While the algorithm presented here was developed with
autonomous vehicles and mobile robotics in mind, it can be generalized to other systems requiring robust, scalable perception without massive manual labeling. The solution’s redundancy (camera plus LiDAR data) makes it especially well-suited for ensuring robust object discovery in downgraded conditions (low light for cameras; low-reflectivity objects for LiDAR).

Project

DeepGreen ANR-23-DEGR-0001. https://deepgreen.ai/

Flagship publication

« Cross-Modal Distillation for 2D/3D Multi-Object Discovery from 2D motion » https://arxiv.org/abs/2503.15022 Lahlali et al., CVPR 2025.

Annotation-free world discovery: With xMOD, 2D and 3D vision work together

Key figure

Saad Lahlali

Learn more

Applications

Project

Flagship publication

CEA-List, the smart digital systems specialists

▼ Naviguer dans le portail ▼