(a) SNAP Overview
SNAP encodes point clouds and prompts separately, then uses a Mask Decoder (Prompt-Point Attention Module + Prediction Heads) to generate segmentation masks. Text prompts are handled by matching CLIP embeddings with predicted mask embeddings for semantic classification.
(b) Domain Normalization
Domain Normalization groups datasets into broader domains with similar statistical properties, allowing the model to effectively adapt to different data distributions while maintaining the flexibility to be applied to new datasets by identifying their general domain.
(c) Prompt-Point Attention Module
The Prompt-Point Attention Module of the Mask Decoder is a series of attention layers that iteratively refines both the prompt and point cloud embeddings. The process is designed to first incorporate contextual information from the point cloud into the prompt embeddings, and then use the refined prompts to condition the point cloud embeddings.
(d) Text Encoder & Prediction Heads
The refined prompt embeddings are then passed through 3 lightweight prediction heads for mask, confidence score, and CLIP embedding predictions. The external CLIP Text Encoder is used for processing text prompts which are then matched with the predicted CLIP embeddings for semantic predictions.