Dense-Iso-ViT
Constant-Resolution Hierarchical Vision Transformer for ร—4 Image Super-Resolution

๐Ÿ’ก For best results, upload a low-resolution image (256ร—256 or smaller). The model upscales it ร—4.

Examples โ€” showing V4 strengths

"Isotropic constant-resolution hierarchical ViT with inter-stage dense feature aggregation โ€” eliminating spatial bottlenecks while preserving coordinate integrity throughout all processing stages."

Key architectural decisions

Isotropic token grid โ€” constant 16ร—16 spatial resolution across all 4 transformer stages. Zero patch merging, zero token downsampling. Every token maps to the same 4ร—4 pixel region from input to output.

Hierarchical embed dims [192 โ†’ 256 โ†’ 288 โ†’ 384] โ€” representational capacity scales with feature complexity. Early stages learn local edges and textures (192-dim is sufficient). Deep stages reason about global scene semantics (384-dim is necessary).

Inter-stage macro concatenation โ€” outputs from all 4 stages concatenated directly to the reconstruction head: cat([h1, h2, h3, h4]) โ†’ [B, 256, 1120]. The head receives low-level edge maps (h1) and high-level semantic context (h4) simultaneously.

GDFN feed-forward โ€” replaces standard MLPs with Gated Depthwise Feed-Forward Networks. Each token sees its 3ร—3 spatial neighborhood during the MLP step. Local spatial context injected at every attention layer.

Bilinear skip connection โ€” output = F.interpolate(lr, 256ร—256) + vit_residual. Model learns residual correction only, not full reconstruction from scratch.

Results

Benchmark Avg PSNR Avg SSIM
DIV2K validation 25.20 dB 0.8298