A training-free, inference-time approach pairing Mask Prior Suppression with Monotonic RoPE Scaling to fix repetitive generation and degraded visual grounding in large diffusion vision–language models.
Large diffusion vision–language models (LDVLMs) enable parallel decoding and bidirectional attention, yet their behavior under long-form generation remains underexplored. We show that existing LDVLMs suffer from repetitive generation and degraded visual grounding, driven by two causes: a mask token prior that pulls hidden representations toward a shared direction over generation steps, and a misalignment between the positional attention bias and the iterative unmasking process that suppresses attention to informative visual tokens.
We propose a training-free approach, Mask Prior Suppression and Monotonic RoPE Scaling, that mitigates mask prior drift and positional attention collapse during decoding. Experiments on multimodal benchmarks and visual grounding tasks show consistent gains over baselines, with robust improvements on long-form description — all from a lightweight, plug-and-play strategy that requires no additional training and generalizes across diverse LDVLM architectures.
Before fixing anything, we measure two concrete failure modes inside LLaDA-V and LaViDa: mask prior drift and positional attention collapse.
Figure 2. (a) Fewer generation steps → lower distinct-n and higher repetition. (b) 3D PCA trajectories: vocab-mean and uncontextualized mask token converge in the final layer. (c) Contextualized mask tokens align with the vocab mean far more strongly than random embeddings.
Across generation steps, the hidden states of contextualized mask tokens are pulled toward the mean vocabulary direction in the final layer.
The fewer the unmasking steps, the stronger the collapse — and the more the model emits structured repetitions instead of diverse text.
This motivates Mask Prior Suppression: remove the prior direction from the final hidden state before the LM head.
Generation tokens disproportionately attend to nearby mask tokens with little semantic content, while attention to distant visual tokens decays sharply with distance.
A frequency decomposition reveals the cause: high-frequency RoPE dimensions dominate at short range, low-frequency ones carry long-range interactions — but the overall attention to far visual tokens stays weak.
This motivates Monotonic RoPE Scaling: lift low-frequency components more strongly than high-frequency ones to restore long-range visual attention.
Figure 3. (a) Attention vs. relative distance — mask tokens are over-attended, visual tokens decay quickly. (b) Per-step attention summed over visual vs. mask tokens — mask tokens absorb comparable mass throughout decoding. (c) Frequency-wise decomposition: high-freq dominates short range, low-freq long range; overall long-range attention is still weak.
Both interventions are plug-in: no retraining, no fine-tuning, no extra parameters.
Figure 1. End-to-end view of Mask Prior Suppression and Monotonic RoPE Scaling.
Mask token hidden states drift toward a shared prior — the vocabulary-mean direction — driving repetitive generation. We forward the mean embedding through the network, run PCA on the layer-wise means to obtain a low-rank prior subspace, and at the final layer project each token's deviation onto this subspace and suppress it adaptively by cosine similarity. Only ~0.07% of the final-layer dimensions are touched, preserving semantics.
High-frequency RoPE components concentrate attention on nearby mask tokens, while distant visual tokens — carried by low-frequency components — are under-attended. We apply a sigmoid-gated, frequency-aware scaling that boosts lower-frequency components more strongly than higher ones, lifting long-range attention to visual tokens while preserving RoPE's relative-distance structure.
Consistent gains across nine benchmarks on both LLaDA-V and LaViDa, with the largest improvements on visual grounding and long-form generation.
| Model | General | Visual Grounding | Long-form Generation | ||||||
|---|---|---|---|---|---|---|---|---|---|
| MME | MMBench | MMMU | RefCOCOg | Ferret | GQA | LLaVA-Bench | DetailCaps | MIA | |
| LLaDA-V | 1998 | 82.9 | 48.6 | 64.8 | 60.4 | 61.6 | 61.3 | 59.8 | 66.1 |
| + Ours | 2003 | 83.3 | 49.3 | 65.0 | 62.9 | 61.6 | 64.1 | 63.6 | 67.0 |
| LaViDa | 1682 | 71.7 | 43.2 | 36.9 | 25.9 | 59.4 | 39.5 | 8.3 | 49.4 |
| + Ours | 1705 | 72.0 | 43.7 | 44.0 | 35.7 | 60.2 | 46.5 | 56.1 | 57.3 |
Highest gains on visual grounding (LaViDa: RefCOCOg +7.1, Ferret +9.8) and long-form generation (LaViDa: DetailCaps +47.8, LLaVA-Bench +7.0). All hyperparameters fixed across tasks; no retraining.
Either component alone helps, but the full model (MPS + MRS) gives the most stable and balanced gains across visual grounding and long-form generation. MPS: Mask Prior Suppression · MRS: Monotonic RoPE Scaling.
| Model | MPS | MRS | Visual Grounding | Long-form Generation | ||
|---|---|---|---|---|---|---|
| RefCOCOg | Ferret | LLaVA-Bench | DetailCaps | |||
| LLaDA-V | — | — | 64.8 | 60.4 | 61.3 | 59.8 |
| ✓ | — | 64.8 | 60.2 | 61.7 | 60.0 | |
| — | ✓ | 64.8 | 60.6 | 63.9 | 60.0 | |
| + Ours | ✓ | ✓ | 65.0 | 62.9 | 64.1 | 63.6 |
| LaViDa | — | — | 36.9 | 25.9 | 39.5 | 8.3 |
| ✓ | — | 44.0 | 35.4 | 42.3 | 56.1 | |
| — | ✓ | 39.5 | 29.8 | 41.0 | 8.3 | |
| + Ours | ✓ | ✓ | 44.0 | 35.7 | 46.5 | 56.1 |
On LaViDa, MPS alone unlocks the largest jump on DetailCaps (8.3 → 56.1) — confirming that mask prior drift drives long-form collapse. MRS independently lifts LLaVA-Bench on LLaDA-V (61.3 → 63.9).
Visual grounding · RefCOCOg-style
Input · red box = target
The woman in the red shirt.
A woman in a green shirt.
Visual grounding · RefCOCOg-style
Input · red box = target
The net of a tennis court.
A woman holding a tennis racket.
Long-form generation · MIA-style
Input
This this room bedroom offers a cozy retreat inviting relaxation. 1. The. The. The. The. 2.. The. The. The. The. The. The 3.. The. The room's design combines functionality and comfort, making it a perfect space for rest and unwind.
This bedroom offers a cozy retreat with a comfortable and inviting bed. The soft pillows create a relaxing ambiance, while warm lighting from the lamps provides a cozy glow throughout the room. The wooden furniture adds a touch of warmth and style, making it a perfect space to relax and unwind.
Long-form generation
Input
This, this image evokes a sense of nostalgia and the. The, the stack of the,, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the the.
This image evokes a sense of nostalgia and the timeless appeal of classic literature. The presence of classics like “The Soul of” by George Bernard Shaw and “The Paintedcible” by William Goldle suggests a collection of works that have been cherished over the years, invoking a sense of the past continuing in the present.
1 / 4
@misc{mpdpac2026,
title = {Mitigating Mask Prior Drift and Positional Attention Collapse
in Large Diffusion Vision-Language Models},
author = {Hong, Sujung and Yoon, Chanyong and Hwang, Seongjae},
year = {2026},
eprint = {2605.14530},
archivePrefix = {arXiv},
primaryClass = {cs.CV}
}