Abstract

This paper presents a reinforcement learning (RL) framework that utilizes Frank-Wolfe policy optimization to solve Coding-Tree-Unit (CTU) bit allocation for Region-of-Interest (ROI) intra-frame coding. Most previous RL-based methods employ the single-critic design, where the rewards for distortion minimization and rate regularization are weighted by an empirically chosen hyper-parameter. Recently, the dual-critic design is proposed to update the actor by alternating the rate and distortion critics. However, its convergence is not guaranteed. To address these issues, we introduce Neural Frank-Wolfe Policy Optimization (NFWPO) in formulating the CTU-level bit allocation as an action-constrained RL problem. In this new framework, we exploit a rate critic to predict a feasible set of actions. With this feasible set, a distortion critic is invoked to update the actor to maximize the ROI-weighted image quality subject to a rate constraint. Experimental results produced with x265 confirm the superiority of the proposed method to the other baselines.

Method


In our proposed method, we first identify the feasible set $\mathcal{C}(s_i)$ by the rate critic. To satisfy target bitrate, $\mathcal{C}(s_i)$ includes the QP values $QP_i$ that the rate reward-to-go $Q_R$ is greater than or equal to a threshold $\epsilon$ (see figure (a)). Then, we utilize NFWPO to update the actor network in three consecutive steps. First, it identifies a feasible update direction $\bar{c}(s)$ through distortion reward-to-go $Q_D$ and feasible set. Second, a reference action $\tilde{a_{s_i}}$ is evaluated by taking a small step along the update direction from projected initial action $\prod\nolimits_{\mathcal{C}(s)}(\pi(s))$. Lastly, it learns the actor network through gradient decent by minimizing the squared error between the reference action.

Paper

Results

The reconstruction quality and QP assignment comparisons on images selected from DAVIS and COCO dataset. The region highlighted by red outlines are the region of interest. Our method preserves more texture details in ROI and shows less blocking artifacts by assigns lower QPs in ROI CTUs. Click on image to enlarge it.

Reconstruction Images

QP Assignment Heatmap