CLVR | Closed-Loop Verified Reasoning

Key Takeaways

Read the paper in one minute.

Single-step generation hits a complexity ceiling.

As prompts become dense with entities, relations, and constraints, simply scaling the backbone gives diminishing returns.

Semantic complexity scaling probe

Closed-loop reasoning breaks hard prompts into recoverable steps.

CLVR enables systematic task decomposition, breaking complex prompts into manageable sub-tasks to avoid the pitfalls of one-shot generation and raise the capacity ceiling, while using iterative visual self-correction to ensure each step remains within reliable boundaries.

Proxy prompts make long-context RL trainable.

PPRL distills messy multimodal histories into explicit step-level reward targets for more stable diffusion alignment.

Weight merging makes the system deployable.

DSWM reuses distillation priors instead of re-distilling costly reasoning trajectories, reducing each step to 4 NFEs.

Method

The system is organized around four compact ideas.

Each component addresses a specific bottleneck in reasoning-based image generation.

Data

Verified trajectory synthesis

A state-constrained controller generates visual reasoning trajectories, with step-level validation and global A/B judging against single-step baselines.

See data pipeline figure

Training

Proxy Prompt RL

A teacher VLM converts long interleaved histories into explicit text and reference-image rewards, enabling step-wise diffusion optimization.

See framework overview

Inference

Trajectory-accumulative conditioning

The diffusion model receives the evolving reasoning trace, preserving global constraints across multiple rounds of generation and editing.

See step-by-step inference figure

Deployment

Δ-Space Weight Merge

Alignment and distillation increments are merged in weight space, keeping reasoning ability while inheriting fast 4-step decoding.

See DSWM efficiency table

CLVR framework overview

The CLVR framework consists of three main components: (1) a training pipeline utilizing Supervised Fine-Tuning (SFT) and Proxy Prompt Reinforcement Learning (PPRL) to align the model with multi-step reasoning; (2) a closed-loop inference framework where the controller iteratively revisits the canvas; and (3) a Δ-Space Weight Merge algorithm that fuses alignment gains with distillation priors for efficient 4-step decoding.

Open source PDF

Verified data production

The data synthesis pipeline follows a Perceive-Reason-Act workflow orchestrated by a state-constrained agentic controller (Gemini 2.5 Pro). It generates high-fidelity reasoning trajectories through a dual-track verification mechanism: passive verification acts as a step-level gatekeeper to discard execution errors, while active verification serves as a global error-correction hub to detect and resolve semantic gaps between the canvas and the user prompt.

Open source PDF

Example verified training trajectory with self-correction

Trajectory data showcase

A visualization of a verified reasoning trajectory synthesized by the data engine. In this example, the system first generates a base image, identifies a missing brand logo and adds it, and finally recognizes missing rear details to change the perspective to a rear three-quarter view. This demonstrates how the controller iteratively refines the image based on visual feedback until the final objective is met.

Open source PDF

Step-by-step CLVR inference trajectory from sketch to final image

Step-by-step inference trajectory

A CLVR inference case of execution of a complex prompt. The trajectory begins with concept initialization, followed by environment embedding, lighting and atmosphere refinement, and final typographic integration, resulting in a cohesive image that fulfills all complex requirements through accumulated reasoning steps.

Open source PDF

Qualitative comparison of CLVR against baselines on challenging prompts

Visual comparison with baselines

Qualitative comparisons between CLVR and strong baselines on challenging prompts. The results illustrate how single-step models often fail to satisfy all constraints (e.g., missed text, wrong object counts, or attribute drift), whereas CLVR's closed-loop approach approaches state-of-the-art proprietary instruction following by catching and correcting these failure modes.

Open source PDF

Semantic complexity scaling probe

Quantitative results of the semantic complexity probe diagnosing structural degradation in single-step generation. The probe stratifies prompts into ten complexity tiers C_task. The left plot shows that CLVR achieves a higher Area Under the Pass-Complexity Curve (AUC Pass) compared to single-step models of similar spectral capacity (I_eff). The right plot illustrates that while single-step models' performance drops sharply as task complexity increases, CLVR maintains a resilient pass rate across all tiers without expanding the backbone capacity.

Open full-size PNG

Results

CLVR improves quality, robustness, and latency together.

Headline metrics across GenEval, GenEval++, WiseBench, PRISM, and ImagineBench, plus probe and ablations aligned with the manuscript tables.

Main benchmark comparison

Overall scores merged from the manuscript’s benchmark tables. Em dash (—) means that benchmark was not reported for that model in the paper; CLVR deltas are vs. the matching FLUX.2 Klein base.

Model	GenEval	GenEval++	WiseBench	PRISM	ImagineBench
GPT-4o (reference)	0.84	0.739	0.80	86.3	8.560
FLUX.1-dev	0.82	0.314	0.50	73.9	6.060
SD3.5-Large	0.71	—	0.46	73.9	—
Qwen-Image	0.87	—	0.62	79.9	—
Uni-CoT	0.83	0.635	0.65	66.1	7.747
BAGEL-7B	0.77	0.371	0.52	65.1	6.200
T2I-R1	—	0.311	0.54	—	6.780
Janus-Pro	0.80	0.246	—	60.7	6.220
FLUX.2 4B [Klein] base	0.74	0.375	0.44	65.7	6.267
CLVR (4B)	0.87 (+0.13)	0.616 (+0.241)	0.74 (+0.30)	76.3 (+10.6)	8.435 (+2.168)
FLUX.2 9B [Klein] base	0.80	0.307	0.52	72.7	7.274
CLVR (9B)	0.88 (+0.08)	0.689 (+0.382)	0.76 (+0.24)	82.1 (+9.4)	8.830 (+1.556)

Ablation: what contributes to the gain?

Main-paper GenEval breakdown on the non-distilled 4B base, isolating closed-loop training, PPRL, and DSWM (same rows as Table “ablation for proposed techniques”).

Setting	Counting	Position	Color Attr.	Overall	p.s
FLUX.2 4B base	0.62	0.52	0.59	0.74	Flux.2 baseline.
+ CLVR without PPRL	0.76	0.71	0.57	0.78	Closed-loop decomposition already improves hard spatial/counting cases.
+ CLVR with PPRL	0.75	0.74	0.72	0.83	Proxy prompts stabilize long-context alignment.
+ CLVR + PPRL + DSWM	0.85	0.85	0.71	0.87	Reasoning alignment and distillation priors merge without destructive interference.

Ablation: CLVR training recipe (4B distill)

Condensed from the appendix ablation table: each row adds one training stage on top of the FLUX.2 4B distill baseline. It isolates (i) open-loop rewrite, (ii) VLM-only CLVR SFT, (iii) diffusion trajectory SFT, (iv) RL with vs. without proxy prompts.

Setting	GenEval Overall	WiseBench Overall
FLUX.2 4B Distill	0.81	0.48
Distill + rewrite (Qwen3-VL, open-loop)	0.78	0.64
+ CLVR (VLM SFT only)	0.86	0.65
+ CLVR + diffusion SFT	0.85	0.62
+ CLVR + diffusion SFT + simple RL	0.84	0.64
+ CLVR + diffusion SFT + PPRL	0.87	0.74

Semantic complexity scaling probe

Numeric counterpart to the probe figure: at matched DiT size, CLVR increases AUC_pass while keeping the spectral capacity proxy I_eff unchanged—i.e., better tier-wise robustness without a wider backbone.

Model	DiT Params	I_eff	Pass	Median C_task	AUC Pass
SD3.5-Med	2.5B	604.11	0.244	28.46	42.53
SD3.5-Lrg	8.0B	977.86	0.288	34.65	53.79
CogView4	6.0B	1411.05	0.352	50.54	70.83
FLUX.2 base	4.0B	1586.36	0.372	50.32	73.89
CLVR (FLUX.2)	4.0B	1586.36	0.451	53.76	98.79 (+24.90)

Inference efficiency with DSWM

Average end-to-end wall-clock time in seconds (same iteration count). Speedup is simply Base ÷ DSWM. All rows use the same definition; none is visually emphasized as “the main result”.

Iterations	Base 28-step	DSWM 4-step	Speedup	Why it matters
1	192.4s	12.6s	15.3×	Fast enough for simple prompts.
2	287.0s	25.5s	11.3×	Most common GenEval trajectory length.
3	471.7s	38.2s	12.3×	Common for harder PRISM prompts.
4	707.7s	52.3s	13.5×	Keeps long reasoning loops practical.

Bottom Line

Complex visual generation benefits from reasoning only when the loop is verified, trainable, and fast.

CLVR makes that loop practical: verified data keeps plans grounded, PPRL gives long-context rewards a stable target, and DSWM preserves deployment speed without expensive reasoning-data re-distillation.

Best use case

Prompts with many entities, relations, attributes, text, or world-knowledge constraints.

Main empirical claim

CLVR consistently improves over its FLUX.2 base and narrows the gap to proprietary models.

Practical claim

Closed-loop reasoning can be accelerated to a deployable regime via 4-step DSWM decoding.

Unlocking Complex Visual Generation via Closed-Loop Verified Reasoning

CLVR handles dense PRISM prompts with iterative visual reasoning.

Read the paper in one minute.

Single-step generation hits a complexity ceiling.

Closed-loop reasoning breaks hard prompts into recoverable steps.

Proxy prompts make long-context RL trainable.

Weight merging makes the system deployable.

The system is organized around four compact ideas.

Verified trajectory synthesis

Proxy Prompt RL

Trajectory-accumulative conditioning

Δ-Space Weight Merge

CLVR framework overview

Verified data production

Trajectory data showcase

Step-by-step inference trajectory

Visual comparison with baselines

Semantic complexity scaling probe

CLVR improves quality, robustness, and latency together.

Main benchmark comparison

Ablation: what contributes to the gain?

Ablation: CLVR training recipe (4B distill)

Semantic complexity scaling probe

Inference efficiency with DSWM

Complex visual generation benefits from reasoning only when the loop is verified, trainable, and fast.

Best use case

Main empirical claim

Practical claim