Unlocking Complex Visual Generation via Closed-Loop Verified Reasoning

CLVR turns complex text-to-image generation into an iterative visual reasoning loop: plan, generate, verify, correct, and stop when the image satisfies the prompt.

0.88 GenEval overall pass score with CLVR 9B.
82.1 PRISM overall score, above the strongest open-source baseline in the paper.
4 NFEs per diffusion step after Δ-Space Weight Merge.
11× Speedup for common two-iteration trajectories: 287.0s to 25.5s.

Key Takeaways

Read the paper in one minute.

1

Single-step generation hits a complexity ceiling.

As prompts become dense with entities, relations, and constraints, simply scaling the backbone gives diminishing returns.

Semantic complexity scaling probe
2

Closed-loop reasoning breaks hard prompts into recoverable steps.

CLVR enables systematic task decomposition, breaking complex prompts into manageable sub-tasks to avoid the pitfalls of one-shot generation and raise the capacity ceiling, while using iterative visual self-correction to ensure each step remains within reliable boundaries.

3

Proxy prompts make long-context RL trainable.

PPRL distills messy multimodal histories into explicit step-level reward targets for more stable diffusion alignment.

4

Weight merging makes the system deployable.

DSWM reuses distillation priors instead of re-distilling costly reasoning trajectories, reducing each step to 4 NFEs.

Method

The system is organized around four compact ideas.

Each component addresses a specific bottleneck in reasoning-based image generation.

Data

Verified trajectory synthesis

A state-constrained controller generates visual reasoning trajectories, with step-level validation and global A/B judging against single-step baselines.

See data pipeline figure

Training

Proxy Prompt RL

A teacher VLM converts long interleaved histories into explicit text and reference-image rewards, enabling step-wise diffusion optimization.

See framework overview

Inference

Trajectory-accumulative conditioning

The diffusion model receives the evolving reasoning trace, preserving global constraints across multiple rounds of generation and editing.

See step-by-step inference figure

Deployment

Δ-Space Weight Merge

Alignment and distillation increments are merged in weight space, keeping reasoning ability while inheriting fast 4-step decoding.

See DSWM efficiency table
CLVR framework overview diagram

CLVR framework overview

The CLVR framework consists of three main components: (1) a training pipeline utilizing Supervised Fine-Tuning (SFT) and Proxy Prompt Reinforcement Learning (PPRL) to align the model with multi-step reasoning; (2) a closed-loop inference framework where the controller iteratively revisits the canvas; and (3) a Δ-Space Weight Merge algorithm that fuses alignment gains with distillation priors for efficient 4-step decoding.

Open source PDF
CLVR verified data synthesis pipeline

Verified data production

The data synthesis pipeline follows a Perceive-Reason-Act workflow orchestrated by a state-constrained agentic controller (Gemini 2.5 Pro). It generates high-fidelity reasoning trajectories through a dual-track verification mechanism: passive verification acts as a step-level gatekeeper to discard execution errors, while active verification serves as a global error-correction hub to detect and resolve semantic gaps between the canvas and the user prompt.

Open source PDF
Example verified training trajectory with self-correction

Trajectory data showcase

A visualization of a verified reasoning trajectory synthesized by the data engine. In this example, the system first generates a base image, identifies a missing brand logo and adds it, and finally recognizes missing rear details to change the perspective to a rear three-quarter view. This demonstrates how the controller iteratively refines the image based on visual feedback until the final objective is met.

Open source PDF
Step-by-step CLVR inference trajectory from sketch to final image

Step-by-step inference trajectory

A CLVR inference case of execution of a complex prompt. The trajectory begins with concept initialization, followed by environment embedding, lighting and atmosphere refinement, and final typographic integration, resulting in a cohesive image that fulfills all complex requirements through accumulated reasoning steps.

Open source PDF
Qualitative comparison of CLVR against baselines on challenging prompts

Visual comparison with baselines

Qualitative comparisons between CLVR and strong baselines on challenging prompts. The results illustrate how single-step models often fail to satisfy all constraints (e.g., missed text, wrong object counts, or attribute drift), whereas CLVR's closed-loop approach approaches state-of-the-art proprietary instruction following by catching and correcting these failure modes.

Open source PDF
Semantic complexity scaling probe: pass rate versus task complexity and AUC summary

Semantic complexity scaling probe

Quantitative results of the semantic complexity probe diagnosing structural degradation in single-step generation. The probe stratifies prompts into ten complexity tiers Ctask. The left plot shows that CLVR achieves a higher Area Under the Pass-Complexity Curve (AUC Pass) compared to single-step models of similar spectral capacity (Ieff). The right plot illustrates that while single-step models' performance drops sharply as task complexity increases, CLVR maintains a resilient pass rate across all tiers without expanding the backbone capacity.

Open full-size PNG

Results

CLVR improves quality, robustness, and latency together.

Headline metrics across GenEval, GenEval++, WiseBench, PRISM, and ImagineBench, plus probe and ablations aligned with the manuscript tables.

Main benchmark comparison

Overall scores merged from the manuscript’s benchmark tables. Em dash (—) means that benchmark was not reported for that model in the paper; CLVR deltas are vs. the matching FLUX.2 Klein base.

Model GenEval GenEval++ WiseBench PRISM ImagineBench
GPT-4o (reference) 0.84 0.739 0.80 86.3 8.560
FLUX.1-dev 0.82 0.314 0.50 73.9 6.060
SD3.5-Large 0.71 0.46 73.9
Qwen-Image 0.87 0.62 79.9
Uni-CoT 0.83 0.635 0.65 66.1 7.747
BAGEL-7B 0.77 0.371 0.52 65.1 6.200
T2I-R1 0.311 0.54 6.780
Janus-Pro 0.80 0.246 60.7 6.220
FLUX.2 4B [Klein] base 0.74 0.375 0.44 65.7 6.267
CLVR (4B) 0.87 (+0.13) 0.616 (+0.241) 0.74 (+0.30) 76.3 (+10.6) 8.435 (+2.168)
FLUX.2 9B [Klein] base 0.80 0.307 0.52 72.7 7.274
CLVR (9B) 0.88 (+0.08) 0.689 (+0.382) 0.76 (+0.24) 82.1 (+9.4) 8.830 (+1.556)

Ablation: what contributes to the gain?

Main-paper GenEval breakdown on the non-distilled 4B base, isolating closed-loop training, PPRL, and DSWM (same rows as Table “ablation for proposed techniques”).

Setting Counting Position Color Attr. Overall p.s
FLUX.2 4B base 0.62 0.52 0.59 0.74 Flux.2 baseline.
+ CLVR without PPRL 0.76 0.71 0.57 0.78 Closed-loop decomposition already improves hard spatial/counting cases.
+ CLVR with PPRL 0.75 0.74 0.72 0.83 Proxy prompts stabilize long-context alignment.
+ CLVR + PPRL + DSWM 0.85 0.85 0.71 0.87 Reasoning alignment and distillation priors merge without destructive interference.

Ablation: CLVR training recipe (4B distill)

Condensed from the appendix ablation table: each row adds one training stage on top of the FLUX.2 4B distill baseline. It isolates (i) open-loop rewrite, (ii) VLM-only CLVR SFT, (iii) diffusion trajectory SFT, (iv) RL with vs. without proxy prompts.

Setting GenEval Overall WiseBench Overall
FLUX.2 4B Distill 0.81 0.48
Distill + rewrite (Qwen3-VL, open-loop) 0.78 0.64
+ CLVR (VLM SFT only) 0.86 0.65
+ CLVR + diffusion SFT 0.85 0.62
+ CLVR + diffusion SFT + simple RL 0.84 0.64
+ CLVR + diffusion SFT + PPRL 0.87 0.74

Semantic complexity scaling probe

Numeric counterpart to the probe figure: at matched DiT size, CLVR increases AUCpass while keeping the spectral capacity proxy Ieff unchanged—i.e., better tier-wise robustness without a wider backbone.

Model DiT Params I_eff Pass Median C_task AUC Pass
SD3.5-Med 2.5B 604.11 0.244 28.46 42.53
SD3.5-Lrg 8.0B 977.86 0.288 34.65 53.79
CogView4 6.0B 1411.05 0.352 50.54 70.83
FLUX.2 base 4.0B 1586.36 0.372 50.32 73.89
CLVR (FLUX.2) 4.0B 1586.36 0.451 53.76 98.79 (+24.90)

Inference efficiency with DSWM

Average end-to-end wall-clock time in seconds (same iteration count). Speedup is simply Base ÷ DSWM. All rows use the same definition; none is visually emphasized as “the main result”.

Iterations Base 28-step DSWM 4-step Speedup Why it matters
1 192.4s 12.6s 15.3× Fast enough for simple prompts.
2 287.0s 25.5s 11.3× Most common GenEval trajectory length.
3 471.7s 38.2s 12.3× Common for harder PRISM prompts.
4 707.7s 52.3s 13.5× Keeps long reasoning loops practical.

Bottom Line

Complex visual generation benefits from reasoning only when the loop is verified, trainable, and fast.

CLVR makes that loop practical: verified data keeps plans grounded, PPRL gives long-context rewards a stable target, and DSWM preserves deployment speed without expensive reasoning-data re-distillation.

Best use case

Prompts with many entities, relations, attributes, text, or world-knowledge constraints.

Main empirical claim

CLVR consistently improves over its FLUX.2 base and narrows the gap to proprietary models.

Practical claim

Closed-loop reasoning can be accelerated to a deployable regime via 4-step DSWM decoding.