MPD · PAC Code  ↗

Mitigating Mask Prior Drift and
Positional Attention Collapse
in Large Diffusion VLMs

A training-free, inference-time approach pairing Mask Prior Suppression with Monotonic RoPE Scaling to fix repetitive generation and degraded visual grounding in large diffusion vision–language models.

Sujung Hong1 Chanyong Yoon1 Seong Jae Hwang1
1 Yonsei University
ICML 2026
Abstract

Large diffusion vision–language models (LDVLMs) enable parallel decoding and bidirectional attention, yet their behavior under long-form generation remains underexplored. We show that existing LDVLMs suffer from repetitive generation and degraded visual grounding, driven by two causes: a mask token prior that pulls hidden representations toward a shared direction over generation steps, and a misalignment between the positional attention bias and the iterative unmasking process that suppresses attention to informative visual tokens.

We propose a training-free approach, Mask Prior Suppression and Monotonic RoPE Scaling, that mitigates mask prior drift and positional attention collapse during decoding. Experiments on multimodal benchmarks and visual grounding tasks show consistent gains over baselines, with robust improvements on long-form description — all from a lightweight, plug-and-play strategy that requires no additional training and generalizes across diverse LDVLM architectures.

Analysis

Where do LDVLMs fail?

Before fixing anything, we measure two concrete failure modes inside LLaDA-V and LaViDa: mask prior drift and positional attention collapse.

Failure 1

Hidden states drift toward a shared mask prior

Figure 2 — token repetition and mask prior drift

Figure 2. (a) Fewer generation steps → lower distinct-n and higher repetition. (b) 3D PCA trajectories: vocab-mean and uncontextualized mask token converge in the final layer. (c) Contextualized mask tokens align with the vocab mean far more strongly than random embeddings.

Across generation steps, the hidden states of contextualized mask tokens are pulled toward the mean vocabulary direction in the final layer.

The fewer the unmasking steps, the stronger the collapse — and the more the model emits structured repetitions instead of diverse text.

This motivates Mask Prior Suppression: remove the prior direction from the final hidden state before the LM head.

Failure 2

Attention collapses onto nearby mask tokens

Generation tokens disproportionately attend to nearby mask tokens with little semantic content, while attention to distant visual tokens decays sharply with distance.

A frequency decomposition reveals the cause: high-frequency RoPE dimensions dominate at short range, low-frequency ones carry long-range interactions — but the overall attention to far visual tokens stays weak.

This motivates Monotonic RoPE Scaling: lift low-frequency components more strongly than high-frequency ones to restore long-range visual attention.

Figure 3 — positional attention collapse

Figure 3. (a) Attention vs. relative distance — mask tokens are over-attended, visual tokens decay quickly. (b) Per-step attention summed over visual vs. mask tokens — mask tokens absorb comparable mass throughout decoding. (c) Frequency-wise decomposition: high-freq dominates short range, low-freq long range; overall long-range attention is still weak.

Method

Two inference-time fixes

Both interventions are plug-in: no retraining, no fine-tuning, no extra parameters.

Method overview diagram

Figure 1. End-to-end view of Mask Prior Suppression and Monotonic RoPE Scaling.

Mask Prior Suppression

Mask token hidden states drift toward a shared prior — the vocabulary-mean direction — driving repetitive generation. We forward the mean embedding through the network, run PCA on the layer-wise means to obtain a low-rank prior subspace, and at the final layer project each token's deviation onto this subspace and suppress it adaptively by cosine similarity. Only ~0.07% of the final-layer dimensions are touched, preserving semantics.

$\tilde{h}_L^{e_j} = h_L^{e_j} \;-\; \lambda\,\cos(\theta_j)\;\mathbf{U}\mathbf{U}^{\!\top}\!\bigl(h_L^{e_j} - \mu\bigr)$
Mean-embedding PCA Final layer only ~0.07% of dims

Monotonic RoPE Scaling

High-frequency RoPE components concentrate attention on nearby mask tokens, while distant visual tokens — carried by low-frequency components — are under-attended. We apply a sigmoid-gated, frequency-aware scaling that boosts lower-frequency components more strongly than higher ones, lifting long-range attention to visual tokens while preserving RoPE's relative-distance structure.

$s_i = 1 + \beta\,\sigma\!\bigl(\eta(\tau_i - \tau_0)\bigr),\quad \tau_i = \tfrac{i}{d/2 - 1}$
Sigmoid gate Low-freq emphasis Preserves relative pos.
Results

Baseline vs. Ours

Consistent gains across nine benchmarks on both LLaDA-V and LaViDa, with the largest improvements on visual grounding and long-form generation.

Main results across nine benchmarks

quantitative
Model General Visual Grounding Long-form Generation
MME MMBench MMMU RefCOCOg Ferret GQA LLaVA-Bench DetailCaps MIA
LLaDA-V 1998 82.9 48.6 64.8 60.4 61.6 61.3 59.8 66.1
 + Ours 2003 83.3 49.3 65.0 62.9 61.6 64.1 63.6 67.0
LaViDa 1682 71.7 43.2 36.9 25.9 59.4 39.5 8.3 49.4
 + Ours 1705 72.0 43.7 44.0 35.7 60.2 46.5 56.1 57.3

Highest gains on visual grounding (LaViDa: RefCOCOg +7.1, Ferret +9.8) and long-form generation (LaViDa: DetailCaps +47.8, LLaVA-Bench +7.0). All hyperparameters fixed across tasks; no retraining.

Ablation: each component matters

ablation

Either component alone helps, but the full model (MPS + MRS) gives the most stable and balanced gains across visual grounding and long-form generation. MPS: Mask Prior Suppression · MRS: Monotonic RoPE Scaling.

Model MPS MRS Visual Grounding Long-form Generation
RefCOCOg Ferret LLaVA-Bench DetailCaps
LLaDA-V 64.8 60.4 61.3 59.8
64.8 60.2 61.7 60.0
64.8 60.6 63.9 60.0
 + Ours 65.0 62.9 64.1 63.6
LaViDa 36.9 25.9 39.5 8.3
44.0 35.4 42.3 56.1
39.5 29.8 41.0 8.3
 + Ours 44.0 35.7 46.5 56.1

On LaViDa, MPS alone unlocks the largest jump on DetailCaps (8.3 → 56.1) — confirming that mask prior drift drives long-form collapse. MRS independently lifts LLaVA-Bench on LLaDA-V (61.3 → 63.9).

Qualitative examples

Cite

BibTeX

@misc{mpdpac2026,
  title         = {Mitigating Mask Prior Drift and Positional Attention Collapse
                   in Large Diffusion Vision-Language Models},
  author        = {Hong, Sujung and Yoon, Chanyong and Hwang, Seongjae},
  year          = {2026},
  eprint        = {2605.14530},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV}
}