Abstract

This paper proposes a transformer-based learned image compression system. It is capable of achieving variable-rate compression with a single model while supporting the regionof-interest (ROI) functionality. Inspired by prompt tuning, we introduce prompt generation networks to condition the transformer-based autoencoder of compression. Our prompt generation networks generate content-adaptive tokens according to the input image, an ROI mask, and a rate parameter. The separation of the ROI mask and the rate parameter allows an intuitive way to achieve variable-rate and ROI coding simultaneously. Extensive experiments validate the effectiveness of our proposed method and confirm its superiority over the other competing methods.

Method


We propose to use learned parameters, known as prompts, as the additional inputs to the Swin-transformer layers, in order to achieve variable-rate and ROI coding. The resulting STB is termed prompted Swintransformer block (P-STB). The learned prompts are produced by two generation networks $p_a, p_s$ for conditioning the encoder and decoder, respectively. $p_a$ consists of several convolutional layers that match those of the encoder $g_a$, and it takes as input the concatenation of the ROI mask, lambda map, and image. The feature maps of $p_a$ are fed into the corresponding P-STBs to generate prompt tokens to be interacted with image tokens. $p_s$ follows a similar architecture, replacing the convolutional layers with the transposed convolutional layers for upsampling. Figure above further details P-STB, where $P_i, I_i$ denote the prompt and image tokens, respectively. They are fed into the $i^{th}$ Swin-transformer layer $S_i$ for window-based attention to arrive at $I_{i+1}$. Specifically, each window has its own image and prompt tokens. We divide spatially the prompt tokens in the same way as the image tokens. Our design differs from regular Visual Prompt Tuning, where all the non-overlapping windows in a Swin-transformer layer share the same learned prompts. We argue that this is not optimal for spatially adaptive quality control such as ROI coding.

Paper

Rate-distortion Results

The rate-distortion performance comparison on two settings, with and without ROI control. The methods are evaluted on Kodak and COCO 2017val with BPP (bits-per-pixel) and (a) full image PSNR (b) ROI-only PSNR. Click on image to enlarge it.

(a) Variable-rate Compression w/o ROI

(b) Variable-rate Compression with ROI

Visualization

Selected visualization of decoded images of our proposed method corresponding to experiment (b) variable rate with ROI. Compared to other method, our prompt-based method more effectively remove the background and enable foreground object quality to be higher. Click on image to enlarge it.


Disentanglment of Rate and Spatial Quality Control

In this work, we disentangle the control of rate ( $\mathbf{M}_\lambda$) and spatial ( $\mathbf{M}_R$) quality control. The image below shows the effect of these two aspects, with varied $\mathbf{M}_R$ on x axis and $\mathbf{M}_\lambda$ on y axis. Click on image to enlarge it.