Abstract
This study proposes a Human-AI interaction retrieval framework that combines natural language descriptions with 3D structural forms to retrieve the 3D fold forms as the example, due to its rich geometric and mathematical properties. We constructed a database covering a variety of 3D fold forms, with style tags and body/garment position tags to assist training. The framework adopts a contrastive learning strategy to process text input through the pre- trained CLIP (Contrastive Language-Image Pre-training) text encoder, and we present a geometric encoder to extract vertex information and point cloud data of 3D fold forms to obtain geometric feature embedding. Then the text features and geometric features are mapped into a joint embedding space, trained the cross-modal alignment through InfoNCE loss. After training, the framework uses FAISS to construct a similarity index of geometric vectors, allowing users to use descriptive language to query theclosest 3D fold forms to the semantic distance in real time. Experiments show that the framework has a satisfactory retrieval accuracy and can retrieve geometrically matching fold forms using only semantic descriptions. This study highlights the potential of AI in connecting design semantics with geometric structures, and provides intelligent tools for utilizing AI to assist the design process.
Keywords
Artificialintelligence; Multimodal; Human-AIinteraction; Retrievalsystem
DOI
https://doi.org/10.21606/iasdr.2025.603
Citation
Huang, T.,and Gao, W.(2025) CLIP the Form: A Human-AI Interaction Framework for Retrieving 3D Structural Forms from Textual Prompts, in Chang, C.-Y., and Hsu, Y. (eds.), IASDR 2025: Design Next, 02-05 December, Taiwan. https://doi.org/10.21606/iasdr.2025.603
Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License
Conference Track
Track 4 - Human-Centered AI
CLIP the Form: A Human-AI Interaction Framework for Retrieving 3D Structural Forms from Textual Prompts
This study proposes a Human-AI interaction retrieval framework that combines natural language descriptions with 3D structural forms to retrieve the 3D fold forms as the example, due to its rich geometric and mathematical properties. We constructed a database covering a variety of 3D fold forms, with style tags and body/garment position tags to assist training. The framework adopts a contrastive learning strategy to process text input through the pre- trained CLIP (Contrastive Language-Image Pre-training) text encoder, and we present a geometric encoder to extract vertex information and point cloud data of 3D fold forms to obtain geometric feature embedding. Then the text features and geometric features are mapped into a joint embedding space, trained the cross-modal alignment through InfoNCE loss. After training, the framework uses FAISS to construct a similarity index of geometric vectors, allowing users to use descriptive language to query theclosest 3D fold forms to the semantic distance in real time. Experiments show that the framework has a satisfactory retrieval accuracy and can retrieve geometrically matching fold forms using only semantic descriptions. This study highlights the potential of AI in connecting design semantics with geometric structures, and provides intelligent tools for utilizing AI to assist the design process.