DIOD (Self-Distillation Meets Object Discovery) boosts the performance of unsupervised object discovery in videos

#computer vision #responsible AI #Self-Distillation

**DIOD outperforms other state-of-the-art unsupervised object discovery methods.© CEA**

One of the fundamental tasks in computer vision use cases is the location of objects of interest in video footage. One of the stumbling blocks to the development of AI in this field is the need for large amounts of annotated data to train the models to perform the task sufficiently well. The principle of object discovery is the location of objects without the need for human-annotated data. And, unlike conventional object detection systems, object discovery can handle unknown object classes.

When self-distillation meets object discovery Low-level signals like motion or depth information can be used to discover objects in an image or in video—without resorting to manual annotation.

Our research focuses on motion-guided object discovery, which presents several technical challenges. First, by definition, the motion information used as a source of supervision does not target static objects. This creates difficulties generalizing to these objects. Second, camera movement generates noise that makes it hard to distinguish moving objects from background elements that appear to be in motion.

We looked to self-distillation—a concept yet unexplored for object discovery—to address these challenges. Self-distillation depends on a teacher model used to automatically label unannotated images, and a student model, which learns to solve the main task using data annotated manually or by the teacher model. The teacher-student setup makes it possible to learn from new unannotated data. The quality of the pseudo-labels initially generated by the teacher model gradually improves.

DIOD is the first method to combine self-distillation with object discovery. With the teacher-student architecture, the teacher model can be updated dynamically based on what the student has learned. The student discovers objects from two sources: the teacher’s attention maps, which include a confidence rating to ensure only high-confidence objects are retained; and the movement masks, from which noisy segments have been removed. The pseudo-labels are gradually improved, increasing overall performance during training. This approach addresses the previously mentioned technical challenges. The discovery of static objects (like parked cars) that the teacher model is able to generalize can now be learned, and the noise caused by camera movement is significantly reduced by the filtering mechanisms applied.

DIOD outperforms state-of-the-art methods by a significant margin (+18.8 points fg-ARI, +43.8 points all-ARI, +8.9 points F1@score on the KITTI dataset). It is more effective at discovering both moving and static objects, eliminating background noise, and distinguishing adjacent objects of the same semantic class.

These capabilities make DIOD a high-performance object discovery method that requires no manual annotation. DIOD could be used to automate annotation, either reducing its cost or eliminating it entirely. It could also be applied to 3D point clouds from LIDAR data, which is potentially highly valuable for automated driving applications, and for the discovery of 2D or 3D objects using a multi-modal model that combines the strengths of 2D RGB images and 3D LIDAR data.

**Comparison of object discovery predictions of three state-of-the-art methods on a TRI-PD dataset image © CEA**

Key features

Discover moving objects without human annotation
Smart pre-annotation of moving objects

Patent DD24102 CJ

Flagship publication

“DIOD: Self-Distillation Meets Object Discovery.”
Kara, S., Ammar, H., Denize, J., Chabot, F., and Pham, Q. C. (2024).
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (rang A*).

DIOD (Self-Distillation Meets Object Discovery) boosts the performance of unsupervised object discovery in videos

When self-distillation meets object discovery Low-level signals like motion or depth information can be used to discover objects in an image or in video—without resorting to manual annotation.

Key features

Patent DD24102 CJ

Flagship publication

Contributor to this article:

See also

Artificial intelligence

Responsible AI

CEA-List, the smart digital systems specialists

▼ Naviguer dans le portail ▼