๐ก For best results, upload a low-resolution image (256ร256 or smaller). The model upscales it ร4.
"Isotropic constant-resolution hierarchical ViT with inter-stage dense feature aggregation โ eliminating spatial bottlenecks while preserving coordinate integrity throughout all processing stages."
Key architectural decisions
Isotropic token grid โ constant 16ร16 spatial resolution across all 4 transformer stages. Zero patch merging, zero token downsampling. Every token maps to the same 4ร4 pixel region from input to output.
Hierarchical embed dims [192 โ 256 โ 288 โ 384] โ representational capacity scales with feature complexity. Early stages learn local edges and textures (192-dim is sufficient). Deep stages reason about global scene semantics (384-dim is necessary).
Inter-stage macro concatenation โ outputs from all 4 stages concatenated directly to the reconstruction head: cat([h1, h2, h3, h4]) โ [B, 256, 1120]. The head receives low-level edge maps (h1) and high-level semantic context (h4) simultaneously.
GDFN feed-forward โ replaces standard MLPs with Gated Depthwise Feed-Forward Networks. Each token sees its 3ร3 spatial neighborhood during the MLP step. Local spatial context injected at every attention layer.
Bilinear skip connection โ output = F.interpolate(lr, 256ร256) + vit_residual. Model learns residual correction only, not full reconstruction from scratch.
Results
| Benchmark | Avg PSNR | Avg SSIM |
|---|---|---|
| DIV2K validation | 25.20 dB | 0.8298 |