SNAP

Abstract

Interactive 3D point cloud segmentation enables efficient annotation of complex 3D scenes through user-guided prompts. However, current approaches are typically restricted in scope to a single domain (indoor or outdoor), and to a single form of user interaction (either spatial clicks or textual prompts). Moreover, training on multiple datasets often leads to negative transfer, resulting in domain-specific tools that lack generalizability.

To address these limitations, we present SNAP (Segment aNything in Any Point cloud), a unified model for interactive 3D segmentation that supports both point-based and text-based prompts across diverse domains. Our approach achieves cross-domain generalizability by training on 7 datasets spanning indoor, outdoor, and aerial environments while employing domain-adaptive normalization to prevent negative transfer. For text-prompted segmentation, we automatically generate mask proposals without human intervention and match them against CLIP embeddings of textual queries, enabling both panoptic and open-vocabulary segmentation.

Extensive experiments demonstrate that SNAP consistently delivers high-quality segmentation results. We achieve state-of-the-art performance on 8 out of 9 zero-shot benchmarks for spatial-prompted segmentation and demonstrate competitive results on all 5 text-prompted benchmarks. These results show that a unified model can match or exceed specialized domain-specific approaches, providing a practical tool for scalable 3D annotation.

Method Overview

(a) SNAP Overview

SNAP encodes point clouds and prompts separately, then uses a Mask Decoder (Prompt-Point Attention Module + Prediction Heads) to generate segmentation masks. Text prompts are handled by matching CLIP embeddings with predicted mask embeddings for semantic classification.

(b) Domain Normalization

Domain Normalization groups datasets into broader domains with similar statistical properties, allowing the model to effectively adapt to different data distributions while maintaining the flexibility to be applied to new datasets by identifying their general domain.

(c) Prompt-Point Attention Module

The Prompt-Point Attention Module of the Mask Decoder is a series of attention layers that iteratively refines both the prompt and point cloud embeddings. The process is designed to first incorporate contextual information from the point cloud into the prompt embeddings, and then use the refined prompts to condition the point cloud embeddings.

(d) Text Encoder & Prediction Heads

The refined prompt embeddings are then passed through 3 lightweight prediction heads for mask, confidence score, and CLIP embedding predictions. The external CLIP Text Encoder is used for processing text prompts which are then matched with the predicted CLIP embeddings for semantic predictions.

Qualitative Results

Ground Truth

SNAP Prediction

Object Selection

Indoor

Outdoor

Aerial

BibTeX

@misc{gupta2025snapsegmentingpointcloud, title={SNAP: Towards Segmenting Anything in Any Point Cloud}, author={Aniket Gupta and Hanhui Wang and Charles Saunders and Aruni RoyChowdhury and Hanumant Singh and Huaizu Jiang}, year={2025}, eprint={2510.11565}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2510.11565}, }

SNAP: Towards Segmenting Anything in Any Point Cloud

SNAP demonstration video