Dense-Iso-ViT SR

Dense-Iso-ViT

Constant-Resolution Hierarchical Vision Transformer for ×4 Image Super-Resolution

Upload any image

💡 For best results, upload a low-resolution image (256×256 or smaller). The model upscales it ×4.

LR Input (bilinear upscaled for display)

SR Output — Dense-Iso-ViT

Ground Truth (original crop)

Examples — showing V4 strengths

"Isotropic constant-resolution hierarchical ViT with inter-stage dense feature aggregation — eliminating spatial bottlenecks while preserving coordinate integrity throughout all processing stages."

Key architectural decisions

Isotropic token grid — constant 16×16 spatial resolution across all 4 transformer stages. Zero patch merging, zero token downsampling. Every token maps to the same 4×4 pixel region from input to output.

Hierarchical embed dims [192 → 256 → 288 → 384] — representational capacity scales with feature complexity. Early stages learn local edges and textures (192-dim is sufficient). Deep stages reason about global scene semantics (384-dim is necessary).

Inter-stage macro concatenation — outputs from all 4 stages concatenated directly to the reconstruction head: cat([h1, h2, h3, h4]) → [B, 256, 1120]. The head receives low-level edge maps (h1) and high-level semantic context (h4) simultaneously.

GDFN feed-forward — replaces standard MLPs with Gated Depthwise Feed-Forward Networks. Each token sees its 3×3 spatial neighborhood during the MLP step. Local spatial context injected at every attention layer.

Bilinear skip connection — output = F.interpolate(lr, 256×256) + vit_residual. Model learns residual correction only, not full reconstruction from scratch.

Results

Benchmark	Avg PSNR	Avg SSIM
DIV2K validation	25.20 dB	0.8298

Built with Gradio logo