MSc Thesis Defense: Doruk Benli, PROMPT-TO-POPULATION: CONTROLLABLE PIPELINE FOR COHERENT AND DIVERSE TEXT TO 3D ASSETS, Date & Time: June 16, 2026 – 10:00 AM, Place: FENS 2019

PROMPT-TO-POPULATION: CONTROLLABLE PIPELINE FOR COHERENT AND DIVERSE TEXT TO 3D ASSETS

Doruk Benli
Computer Science & Engineering, MSc Thesis, June 2026

Thesis Jury

Prof. Selim Balcısoy

Prof. Tolga Çapın

Asst. Prof. Polat Göktaş

Date & Time: June 16th, 2026 – 10:00 AM

Place: FENS 2019

Keywords : Crowd Generation, Text-to-3D Synthesis, Procedural Character Generation, Generative Models, Population-Level Semantic Modeling

Abstract

Crowd simulation has been an essential technique to design realistic virtual environments and urban planning. However, populating crowd scenes with high-fidelity, realistic and coherent crowds is a labor intensive and time consuming task. In this thesis, we address the limitations of fast, coherent, and editable crowd assets for simulations and propose a controllable 3D crowd generation pipeline from text to ease populating by generating semantically coherent and editable 3D assets. Our framework utilizes Large Language Models (LLMs) to expand generalized prompts into detailed visual descriptions. We then employ a specialized full-body diffusion model with pose conditioning to generate consistent character previews in T-Pose, followed by a cascaded refinement stage to improve visual details to high-fidelity quality. These full-body T-Pose images are then processed through an image-to-3D reconstruction pipeline to produce T-posed meshes directly compatible with standard auto-rigging pipelines such as Adobe Mixamo. We further introduce the Visual Crowd Coherence Score (VCCS), a population-level evaluation metric based on CLIP embeddings that balances intra-group similarity, distributional spread, and explicit mode-collapse penalization. Targeted ablation studies demonstrate that removing hierarchical trait pooling degrades VCCS by approximately 10%, while disabling cascaded refinement results in up to 35% degradation. We further show that independent LLM generated descriptions without Trait Pool still suffer from mode collapse, validating the importance of population-level semantic modeling. A perceptual study (N=30) finds that our method produces crowds perceived as significantly more diverse and coherent than non-structured baselines (Wilcoxon signed-rank, p<0.001).