Share

3D scene analysis using natural language queries

DiSCO-3D unifie la segmentation sémantique 3D non supervisée et en vocabulaire ouvert afin de découvrir des sous-concepts sémantiques adaptés au contenu de la scène.
The DiSCO-3D semantic segmentation method is used to discover, in a 3D scene, the elements corresponding to the semantic subconcepts of a user query expressed in natural language. The method, which uses a NeRF*-based approach, delivers a highlevel understanding of scenes for robotics or augmented reality applications, for example.

The methods currently used for 3D semantic segmentation either identify objects corresponding to a single semantic concept searched for by the user (open vocabulary segmentation, or OV-Seg) or adapt to the content of the scene by discovering several semantic concepts (unsupervised semantic segmentation, or USS).

The broader problem of open vocabulary sub-concept discovery (OV-SD) is effectively addressed by bringing these two paradigms together, for the first time ever, in DiSCO-3D. The objective is to discover the different semantic sub-concepts in the 3D scene that are relevant to a natural language query (Figure 1).

Figure 1: Segmentation of the sub-concepts discovered for the «sleep» (top
image) and «furniture» (bottom image) queries.

DiSCO-3D’s architecture (Figure 2) is made up of two modules. The first module completes the OV-Seg task to identify areas of the scene that do not correspond to the user’s query (the background). The second module “forces” one of the segments to superimpose on the background identified by the first module, creating the USS. This supervision ensures that the other segments discovered by the USS correspond to semantic subconcepts relevant to the query.


Figure 2: DiSCO-3D for a LeRF** features field

The method proved effective on a variety of user queries on different scenes (Figure 3).


Figure 3: Qualitative evaluation of DiSCO-3D for different queries.

Finally, because the queries are expressed in natural language, DiSCO-3D is easy to integrate as an agentic AI tool. This would make 3D scene analysis using large language models (LLMs) possible.

*NeRF: Neural Radiance Fields are a state-of-the-art technology that uses a neural network to reconstruct 3D scenes from 2D images.
**LeRF: Language Embedded Radiance Fields extend the capabilities of NeRFs by associating semantic information with each point in a scene.

Learn more

Applications

  • Reality capture, autonomous robotics.

Patent

  • «Méthode de découverte automatique de sous-concepts sémantiques dans une scène», D. Petit, S. Bourgeois; V. Gay-Bellile ; F. Chabot. French patent no. FR2501675.

Flagship publication

  • CEA-List’s FactoryIA supercomputer, supported by the Île-de France Regional Council, made this research possible.

Contributors to this article

  • Doriand Petit, PhD student, CEA-List
  • Steve Bourgeois, Research Engineer, CEA-List