Single-step generation hits a complexity ceiling.
As prompts become dense with entities, relations, and constraints, simply scaling the backbone gives diminishing returns.
Semantic complexity scaling probeCLVR turns complex text-to-image generation into an iterative visual reasoning loop: plan, generate, verify, correct, and stop when the image satisfies the prompt.
PRISM qualitative results, stitched left to right: panel A · panel B · panel C.
Key Takeaways
As prompts become dense with entities, relations, and constraints, simply scaling the backbone gives diminishing returns.
Semantic complexity scaling probeCLVR enables systematic task decomposition, breaking complex prompts into manageable sub-tasks to avoid the pitfalls of one-shot generation and raise the capacity ceiling, while using iterative visual self-correction to ensure each step remains within reliable boundaries.
PPRL distills messy multimodal histories into explicit step-level reward targets for more stable diffusion alignment.
DSWM reuses distillation priors instead of re-distilling costly reasoning trajectories, reducing each step to 4 NFEs.
Method
Each component addresses a specific bottleneck in reasoning-based image generation.
Data
A state-constrained controller generates visual reasoning trajectories, with step-level validation and global A/B judging against single-step baselines.
See data pipeline figureTraining
A teacher VLM converts long interleaved histories into explicit text and reference-image rewards, enabling step-wise diffusion optimization.
See framework overviewInference
The diffusion model receives the evolving reasoning trace, preserving global constraints across multiple rounds of generation and editing.
See step-by-step inference figureDeployment
Alignment and distillation increments are merged in weight space, keeping reasoning ability while inheriting fast 4-step decoding.
See DSWM efficiency tableThe CLVR framework consists of three main components: (1) a training pipeline utilizing Supervised Fine-Tuning (SFT) and Proxy Prompt Reinforcement Learning (PPRL) to align the model with multi-step reasoning; (2) a closed-loop inference framework where the controller iteratively revisits the canvas; and (3) a Δ-Space Weight Merge algorithm that fuses alignment gains with distillation priors for efficient 4-step decoding.
Open source PDFThe data synthesis pipeline follows a Perceive-Reason-Act workflow orchestrated by a state-constrained agentic controller (Gemini 2.5 Pro). It generates high-fidelity reasoning trajectories through a dual-track verification mechanism: passive verification acts as a step-level gatekeeper to discard execution errors, while active verification serves as a global error-correction hub to detect and resolve semantic gaps between the canvas and the user prompt.
Open source PDFA visualization of a verified reasoning trajectory synthesized by the data engine. In this example, the system first generates a base image, identifies a missing brand logo and adds it, and finally recognizes missing rear details to change the perspective to a rear three-quarter view. This demonstrates how the controller iteratively refines the image based on visual feedback until the final objective is met.
Open source PDFA CLVR inference case of execution of a complex prompt. The trajectory begins with concept initialization, followed by environment embedding, lighting and atmosphere refinement, and final typographic integration, resulting in a cohesive image that fulfills all complex requirements through accumulated reasoning steps.
Open source PDFQualitative comparisons between CLVR and strong baselines on challenging prompts. The results illustrate how single-step models often fail to satisfy all constraints (e.g., missed text, wrong object counts, or attribute drift), whereas CLVR's closed-loop approach approaches state-of-the-art proprietary instruction following by catching and correcting these failure modes.
Open source PDF
Quantitative results of the semantic complexity probe diagnosing structural degradation in single-step generation. The probe stratifies prompts into ten complexity tiers Ctask. The left plot shows that CLVR achieves a higher Area Under the Pass-Complexity Curve (AUC Pass) compared to single-step models of similar spectral capacity (Ieff). The right plot illustrates that while single-step models' performance drops sharply as task complexity increases, CLVR maintains a resilient pass rate across all tiers without expanding the backbone capacity.
Open full-size PNGResults
Headline metrics across GenEval, GenEval++, WiseBench, PRISM, and ImagineBench, plus probe and ablations aligned with the manuscript tables.
Overall scores merged from the manuscript’s benchmark tables. Em dash (—) means that benchmark was not reported for that model in the paper; CLVR deltas are vs. the matching FLUX.2 Klein base.
| Model | GenEval | GenEval++ | WiseBench | PRISM | ImagineBench |
|---|---|---|---|---|---|
| GPT-4o (reference) | 0.84 | 0.739 | 0.80 | 86.3 | 8.560 |
| FLUX.1-dev | 0.82 | 0.314 | 0.50 | 73.9 | 6.060 |
| SD3.5-Large | 0.71 | — | 0.46 | 73.9 | — |
| Qwen-Image | 0.87 | — | 0.62 | 79.9 | — |
| Uni-CoT | 0.83 | 0.635 | 0.65 | 66.1 | 7.747 |
| BAGEL-7B | 0.77 | 0.371 | 0.52 | 65.1 | 6.200 |
| T2I-R1 | — | 0.311 | 0.54 | — | 6.780 |
| Janus-Pro | 0.80 | 0.246 | — | 60.7 | 6.220 |
| FLUX.2 4B [Klein] base | 0.74 | 0.375 | 0.44 | 65.7 | 6.267 |
| CLVR (4B) | 0.87 (+0.13) | 0.616 (+0.241) | 0.74 (+0.30) | 76.3 (+10.6) | 8.435 (+2.168) |
| FLUX.2 9B [Klein] base | 0.80 | 0.307 | 0.52 | 72.7 | 7.274 |
| CLVR (9B) | 0.88 (+0.08) | 0.689 (+0.382) | 0.76 (+0.24) | 82.1 (+9.4) | 8.830 (+1.556) |
Main-paper GenEval breakdown on the non-distilled 4B base, isolating closed-loop training, PPRL, and DSWM (same rows as Table “ablation for proposed techniques”).
| Setting | Counting | Position | Color Attr. | Overall | p.s |
|---|---|---|---|---|---|
| FLUX.2 4B base | 0.62 | 0.52 | 0.59 | 0.74 | Flux.2 baseline. |
| + CLVR without PPRL | 0.76 | 0.71 | 0.57 | 0.78 | Closed-loop decomposition already improves hard spatial/counting cases. |
| + CLVR with PPRL | 0.75 | 0.74 | 0.72 | 0.83 | Proxy prompts stabilize long-context alignment. |
| + CLVR + PPRL + DSWM | 0.85 | 0.85 | 0.71 | 0.87 | Reasoning alignment and distillation priors merge without destructive interference. |
Condensed from the appendix ablation table: each row adds one training stage on top of the FLUX.2 4B distill baseline. It isolates (i) open-loop rewrite, (ii) VLM-only CLVR SFT, (iii) diffusion trajectory SFT, (iv) RL with vs. without proxy prompts.
| Setting | GenEval Overall | WiseBench Overall |
|---|---|---|
| FLUX.2 4B Distill | 0.81 | 0.48 |
| Distill + rewrite (Qwen3-VL, open-loop) | 0.78 | 0.64 |
| + CLVR (VLM SFT only) | 0.86 | 0.65 |
| + CLVR + diffusion SFT | 0.85 | 0.62 |
| + CLVR + diffusion SFT + simple RL | 0.84 | 0.64 |
| + CLVR + diffusion SFT + PPRL | 0.87 | 0.74 |
Numeric counterpart to the probe figure: at matched DiT size, CLVR increases AUCpass while keeping the spectral capacity proxy Ieff unchanged—i.e., better tier-wise robustness without a wider backbone.
| Model | DiT Params | I_eff | Pass | Median C_task | AUC Pass |
|---|---|---|---|---|---|
| SD3.5-Med | 2.5B | 604.11 | 0.244 | 28.46 | 42.53 |
| SD3.5-Lrg | 8.0B | 977.86 | 0.288 | 34.65 | 53.79 |
| CogView4 | 6.0B | 1411.05 | 0.352 | 50.54 | 70.83 |
| FLUX.2 base | 4.0B | 1586.36 | 0.372 | 50.32 | 73.89 |
| CLVR (FLUX.2) | 4.0B | 1586.36 | 0.451 | 53.76 | 98.79 (+24.90) |
Average end-to-end wall-clock time in seconds (same iteration count). Speedup is simply Base ÷ DSWM. All rows use the same definition; none is visually emphasized as “the main result”.
| Iterations | Base 28-step | DSWM 4-step | Speedup | Why it matters |
|---|---|---|---|---|
| 1 | 192.4s | 12.6s | 15.3× | Fast enough for simple prompts. |
| 2 | 287.0s | 25.5s | 11.3× | Most common GenEval trajectory length. |
| 3 | 471.7s | 38.2s | 12.3× | Common for harder PRISM prompts. |
| 4 | 707.7s | 52.3s | 13.5× | Keeps long reasoning loops practical. |
Bottom Line
CLVR makes that loop practical: verified data keeps plans grounded, PPRL gives long-context rewards a stable target, and DSWM preserves deployment speed without expensive reasoning-data re-distillation.
Prompts with many entities, relations, attributes, text, or world-knowledge constraints.
CLVR consistently improves over its FLUX.2 base and narrows the gap to proprietary models.
Closed-loop reasoning can be accelerated to a deployable regime via 4-step DSWM decoding.