Mitigating Mask Prior Drift and
Positional Attention Collapse
in Large Diffusion VLMs

A training-free, inference-time approach pairing Mask Prior Suppression with Monotonic RoPE Scaling to fix repetitive generation and degraded visual grounding in large diffusion vision–language models.

Sujung Hong¹ Chanyong Yoon¹ Seong Jae Hwang¹

¹ Yonsei University

ICML 2026

arXiv ↗ Code BibTeX

Abstract

Large diffusion vision–language models (LDVLMs) enable parallel decoding and bidirectional attention, yet their behavior under long-form generation remains underexplored. We show that existing LDVLMs suffer from repetitive generation and degraded visual grounding, driven by two causes: a mask token prior that pulls hidden representations toward a shared direction over generation steps, and a misalignment between the positional attention bias and the iterative unmasking process that suppresses attention to informative visual tokens.

We propose a training-free approach, Mask Prior Suppression and Monotonic RoPE Scaling, that mitigates mask prior drift and positional attention collapse during decoding. Experiments on multimodal benchmarks and visual grounding tasks show consistent gains over baselines, with robust improvements on long-form description — all from a lightweight, plug-and-play strategy that requires no additional training and generalizes across diverse LDVLM architectures.

Analysis

Where do LDVLMs fail?

Before fixing anything, we measure two concrete failure modes inside LLaDA-V and LaViDa: mask prior drift and positional attention collapse.

Failure 1

Hidden states drift toward a shared mask prior

Figure 2 — token repetition and mask prior drift

Figure 2. (a) Fewer generation steps → lower distinct-n and higher repetition. (b) 3D PCA trajectories: vocab-mean and uncontextualized mask token converge in the final layer. (c) Contextualized mask tokens align with the vocab mean far more strongly than random embeddings.

Across generation steps, the hidden states of contextualized mask tokens are pulled toward the mean vocabulary direction in the final layer.

The fewer the unmasking steps, the stronger the collapse — and the more the model emits structured repetitions instead of diverse text.

This motivates Mask Prior Suppression: remove the prior direction from the final hidden state before the LM head.

Failure 2

Attention collapses onto nearby mask tokens

Generation tokens disproportionately attend to nearby mask tokens with little semantic content, while attention to distant visual tokens decays sharply with distance.

A frequency decomposition reveals the cause: high-frequency RoPE dimensions dominate at short range, low-frequency ones carry long-range interactions — but the overall attention to far visual tokens stays weak.

This motivates Monotonic RoPE Scaling: lift low-frequency components more strongly than high-frequency ones to restore long-range visual attention.

Figure 3 — positional attention collapse

Figure 3. (a) Attention vs. relative distance — mask tokens are over-attended, visual tokens decay quickly. (b) Per-step attention summed over visual vs. mask tokens — mask tokens absorb comparable mass throughout decoding. (c) Frequency-wise decomposition: high-freq dominates short range, low-freq long range; overall long-range attention is still weak.

Method

Two inference-time fixes

Both interventions are plug-in: no retraining, no fine-tuning, no extra parameters.

Figure 1. End-to-end view of Mask Prior Suppression and Monotonic RoPE Scaling.

Mask Prior Suppression

Mask token hidden states drift toward a shared prior — the vocabulary-mean direction — driving repetitive generation. We forward the mean embedding through the network, run PCA on the layer-wise means to obtain a low-rank prior subspace, and at the final layer project each token's deviation onto this subspace and suppress it adaptively by cosine similarity. Only ~0.07% of the final-layer dimensions are touched, preserving semantics.

$\tilde{h}_L^{e_j} = h_L^{e_j} \;-\; \lambda\,\cos(\theta_j)\;\mathbf{U}\mathbf{U}^{\!\top}\!\bigl(h_L^{e_j} - \mu\bigr)$

Mean-embedding PCA Final layer only ~0.07% of dims

Monotonic RoPE Scaling

High-frequency RoPE components concentrate attention on nearby mask tokens, while distant visual tokens — carried by low-frequency components — are under-attended. We apply a sigmoid-gated, frequency-aware scaling that boosts lower-frequency components more strongly than higher ones, lifting long-range attention to visual tokens while preserving RoPE's relative-distance structure.

$s_i = 1 + \beta\,\sigma\!\bigl(\eta(\tau_i - \tau_0)\bigr),\quad \tau_i = \tfrac{i}{d/2 - 1}$

Sigmoid gate Low-freq emphasis Preserves relative pos.

Results

Baseline vs. Ours

Consistent gains across nine benchmarks on both LLaDA-V and LaViDa, with the largest improvements on visual grounding and long-form generation.

Main results across nine benchmarks

quantitative

Model	General			Visual Grounding			Long-form Generation
	MME	MMBench	MMMU	RefCOCOg	Ferret	GQA	LLaVA-Bench	DetailCaps	MIA
LLaDA-V	1998	82.9	48.6	64.8	60.4	61.6	61.3	59.8	66.1
+ Ours	2003	83.3	49.3	65.0	62.9	61.6	64.1	63.6	67.0
LaViDa	1682	71.7	43.2	36.9	25.9	59.4	39.5	8.3	49.4
+ Ours	1705	72.0	43.7	44.0	35.7	60.2	46.5	56.1	57.3

Highest gains on visual grounding (LaViDa: RefCOCOg +7.1, Ferret +9.8) and long-form generation (LaViDa: DetailCaps +47.8, LLaVA-Bench +7.0). All hyperparameters fixed across tasks; no retraining.

Ablation: each component matters

ablation

Either component alone helps, but the full model (MPS + MRS) gives the most stable and balanced gains across visual grounding and long-form generation. MPS: Mask Prior Suppression · MRS: Monotonic RoPE Scaling.

Model	MPS	MRS	Visual Grounding		Long-form Generation
			RefCOCOg	Ferret	LLaVA-Bench	DetailCaps
LLaDA-V	—	—	64.8	60.4	61.3	59.8
	✓	—	64.8	60.2	61.7	60.0
	—	✓	64.8	60.6	63.9	60.0
+ Ours	✓	✓	65.0	62.9	64.1	63.6
LaViDa	—	—	36.9	25.9	39.5	8.3
	✓	—	44.0	35.4	42.3	56.1
	—	✓	39.5	29.8	41.0	8.3
+ Ours	✓	✓	44.0	35.7	46.5	56.1

On LaViDa, MPS alone unlocks the largest jump on DetailCaps (8.3 → 56.1) — confirming that mask prior drift drives long-form collapse. MRS independently lifts LLaVA-Bench on LLaDA-V (61.3 → 63.9).

Qualitative examples

LLaDA-V — “Provide a short description for this region.”

Visual grounding · RefCOCOg-style

grounding

Input · red box = target

Baseline

The woman in the red shirt.

Ours

A woman in a green shirt.

LaViDa — “Provide a short description for this region.”

Visual grounding · RefCOCOg-style

grounding

Input · red box = target

Baseline

The net of a tennis court.

Ours

A woman holding a tennis racket.

LLaDA-V — “Write a short real-estate ad for this room. Mention three selling points.”

Long-form generation · MIA-style

long-form

Input

Baseline

This this room bedroom offers a cozy retreat inviting relaxation. 1. The. The. The. The. 2.. The. The. The. The. The. The 3.. The. The room's design combines functionality and comfort, making it a perfect space for rest and unwind.

Ours

This bedroom offers a cozy retreat with a comfortable and inviting bed. The soft pillows create a relaxing ambiance, while warm lighting from the lamps provides a cozy glow throughout the room. The wooden furniture adds a touch of warmth and style, making it a perfect space to relax and unwind.

LaViDa — “Express the feelings that might be elicited by this image using a first-person perspective, specify any author's name visible, while implying a sense of nostalgia.”

Long-form generation

long-form

Input

Baseline

This, this image evokes a sense of nostalgia and the. The, the stack of the,, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the the.

Ours

This image evokes a sense of nostalgia and the timeless appeal of classic literature. The presence of classics like “The Soul of” by George Bernard Shaw and “The Paintedcible” by William Goldle suggests a collection of works that have been cherished over the years, invoking a sense of the past continuing in the present.

1 / 4

Cite

BibTeX

@misc{mpdpac2026,
  title         = {Mitigating Mask Prior Drift and Positional Attention Collapse
                   in Large Diffusion Vision-Language Models},
  author        = {Hong, Sujung and Yoon, Chanyong and Hwang, Seongjae},
  year          = {2026},
  eprint        = {2605.14530},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV}
}

Mitigating Mask Prior Drift and Positional Attention Collapse in Large Diffusion VLMs

Where do LDVLMs fail?

Hidden states drift toward a shared mask prior

Attention collapses onto nearby mask tokens

Two inference-time fixes

Mask Prior Suppression

Monotonic RoPE Scaling

Baseline vs. Ours

Main results across nine benchmarks

Ablation: each component matters

LLaDA-V — “Provide a short description for this region.”

LaViDa — “Provide a short description for this region.”

LLaDA-V — “Write a short real-estate ad for this room. Mention three selling points.”

LaViDa — “Express the feelings that might be elicited by this image using a first-person perspective, specify any author's name visible, while implying a sense of nostalgia.”

BibTeX

Mitigating Mask Prior Drift and
Positional Attention Collapse
in Large Diffusion VLMs