MSc Thesis Defense: Doruk Benli, PROMPT-TO-POPULATION: CONTROLLABLE PIPELINE FOR COHERENT AND DIVERSE TEXT TO 3D ASSETS, Date & Time: June 16, 2026 – 10:00 AM, Place: FENS 2019
PROMPT-TO-POPULATION: CONTROLLABLE PIPELINE FOR COHERENT AND DIVERSE TEXT TO 3D ASSETS
Doruk Benli
Computer Science & Engineering, MSc Thesis, June 2026
Thesis Jury
Prof. Selim Balcısoy
Prof. Tolga Çapın
Asst. Prof. Polat Göktaş
Date & Time: June 16th, 2026 – 10:00 AM
Place: FENS 2019
Keywords : Crowd Generation, Text-to-3D Synthesis, Procedural Character Generation, Generative Models, Population-Level Semantic Modeling
Abstract
Crowd simulation has been an essential technique to design realistic virtual environments and urban planning. However, populating crowd scenes with high-fidelity, realistic and coherent crowds is a labor intensive and time consuming task. In this thesis, we address the limitations of fast, coherent, and editable crowd assets for simulations and propose a controllable 3D crowd generation pipeline from text to ease populating by generating semantically coherent and editable 3D assets. Our framework utilizes Large Language Models (LLMs) to expand generalized prompts into detailed visual descriptions. We then employ a specialized full-body diffusion model with pose conditioning to generate consistent character previews in T-Pose, followed by a cascaded refinement stage to improve visual details to high-fidelity quality. These full-body T-Pose images are then processed through an image-to-3D reconstruction pipeline to produce T-posed meshes directly compatible with standard auto-rigging pipelines such as Adobe Mixamo. We further introduce the Visual Crowd Coherence Score (VCCS), a population-level evaluation metric based on CLIP embeddings that balances intra-group similarity, distributional spread, and explicit mode-collapse penalization. Targeted ablation studies demonstrate that removing hierarchical trait pooling degrades VCCS by approximately 10%, while disabling cascaded refinement results in up to 35% degradation. We further show that independent LLM generated descriptions without Trait Pool still suffer from mode collapse, validating the importance of population-level semantic modeling. A perceptual study (N=30) finds that our method produces crowds perceived as significantly more diverse and coherent than non-structured baselines (Wilcoxon signed-rank, p<0.001).