In this work, we propose a transformer-based image codec capable of achieving variable-rate compression while supporting ROI functionality. Inspired by prompting techniques, we introduce prompt generation networks to condition our transformer-based codec. Figure above illustrates our overall architecture. It is built upon TIC (Transformer-based Image Compression). To encode an input image $x$, the encoder takes two additional inputs, a lambda map $\mathbf{M}_R$ and a ROI mask $\mathbf{M}_\lambda$. The lambda map $\mathbf{M}_\lambda$ is a uniform map populated with the same rate parameter $m_\lambda$ ∈ [0, 1] that controls the bit rate of the compressed bitstream. The ROI mask $\mathbf{M}_R$ specifies spatially the importance of individual pixels in the image. Each element in the ROI mask is a real value in [0, 1]. Both inputs serve as the conditioning signal utilized to generate prompt tokens for adapting the main encoder. In a similar way, the decoder is adapted by taking as inputs the quantized image latent $\hat{y}$ and a downscaled lambda map $\mathbf{M}'_\lambda$ that matches the spatial resolution of the latent $\hat{y}$.Our design has the striking feature of disentangling the rate (i.e. $\mathbf{M}_\lambda$) and spatial quality control (i.e. $\mathbf{M}_R$). In other words, it treats them as two independent dimensions, offering a more intuitive way to achieve simultaneous variable-rate and ROI coding.
We propose to use learned parameters, known as prompts, as the additional inputs to the Swin-transformer layers, in order to achieve variable-rate and ROI coding. The resulting STB is termed prompted Swintransformer block (P-STB). The learned prompts are produced by two generation networks $p_a, p_s$ for conditioning the encoder and decoder, respectively. $p_a$ consists of several convolutional layers that match those of the encoder $g_a$, and it takes as input the concatenation of the ROI mask, lambda map, and image. The feature maps of $p_a$ are fed into the corresponding P-STBs to generate prompt tokens to be interacted with image tokens. $p_s$ follows a similar architecture, replacing the convolutional layers with the transposed convolutional layers for upsampling. Figure above further details P-STB, where $P_i, I_i$ denote the prompt and image tokens, respectively. They are fed into the $i^{th}$ Swin-transformer layer $S_i$ for window-based attention to arrive at $I_{i+1}$. Specifically, each window has its own image and prompt tokens. We divide spatially the prompt tokens in the same way as the image tokens. Our design differs from regular Visual Prompt Tuning, where all the non-overlapping windows in a Swin-transformer layer share the same learned prompts. We argue that this is not optimal for spatially adaptive quality control such as ROI coding.
The rate-distortion performance comparison on two settings, with and without ROI control. The methods are evaluted on Kodak and COCO 2017val with BPP (bits-per-pixel) and (a) full image PSNR (b) ROI-only PSNR. Click on image to enlarge it.
Selected visualization of decoded images of our proposed method corresponding to experiment (b) variable rate with ROI. Compared to other method, our prompt-based method more effectively remove the background and enable foreground object quality to be higher. Click on image to enlarge it.
In this work, we disentangle the control of rate ( $\mathbf{M}_\lambda$) and spatial ( $\mathbf{M}_R$) quality control. The image below shows the effect of these two aspects, with varied $\mathbf{M}_R$ on x axis and $\mathbf{M}_\lambda$ on y axis. Click on image to enlarge it.