Neural Frank-Wolfe Policy Optimization for Region-of-Interest Intra-Frame Coding with HEVC/H.265

Abstract

This paper presents a reinforcement learning (RL) framework that utilizes Frank-Wolfe policy optimization to solve Coding-Tree-Unit (CTU) bit allocation for Region-of-Interest (ROI) intra-frame coding. Most previous RL-based methods employ the single-critic design, where the rewards for distortion minimization and rate regularization are weighted by an empirically chosen hyper-parameter. Recently, the dual-critic design is proposed to update the actor by alternating the rate and distortion critics. However, its convergence is not guaranteed. To address these issues, we introduce Neural Frank-Wolfe Policy Optimization (NFWPO) in formulating the CTU-level bit allocation as an action-constrained RL problem. In this new framework, we exploit a rate critic to predict a feasible set of actions. With this feasible set, a distortion critic is invoked to update the actor to maximize the ROI-weighted image quality subject to a rate constraint. Experimental results produced with x265 confirm the superiority of the proposed method to the other baselines.

Overview

The figure illustrates our action-constrained RL framework. When encoding CTU $i$, a state $s_i$ is first evaluated. With the state as input, our RL agent outputs an action (QP). The x265 codec then encodes CTU $i$, we then evaluate a distortion reward and a rate reward. These steps are repeated until all the CTUs in a frame are encoded. At training time, the agent interacts with x265 by encoding every frame as an episodic task. We utilize the distortion and rate critics to predict the distortion and rate reward-to-go, respectively. The rate critic, which predicts the rate deviation from target bitrate at the end of encoding a frame, enables us to specify a feasible set $\mathcal{C}(s_i)$ of action. The distortion critic, which estimates the cumulative distortion, guides the agent to minimize the total ROI-weighted distortion.

Method

In our proposed method, we first identify the feasible set $\mathcal{C}(s_i)$ by the rate critic. To satisfy target bitrate, $\mathcal{C}(s_i)$ includes the QP values $QP_i$ that the rate reward-to-go $Q_R$ is greater than or equal to a threshold $\epsilon$ (see figure (a)). Then, we utilize NFWPO to update the actor network in three consecutive steps. First, it identifies a feasible update direction $\bar{c}(s)$ through distortion reward-to-go $Q_D$ and feasible set. Second, a reference action $\tilde{a_{s_i}}$ is evaluated by taking a small step along the update direction from projected initial action $\prod\nolimits_{\mathcal{C}(s)}(\pi(s))$. Lastly, it learns the actor network through gradient decent by minimizing the squared error between the reference action.

Paper

Results

The reconstruction quality and QP assignment comparisons on images selected from DAVIS and COCO dataset. The region highlighted by red outlines are the region of interest. Our method preserves more texture details in ROI and shows less blocking artifacts by assigns lower QPs in ROI CTUs. Click on image to enlarge it.

Neural Frank-Wolfe Policy Optimization for
Region-of-Interest Intra-Frame Coding with HEVC/H.265

Abstract

Overview

Method

Paper

Results

Reconstruction Images

QP Assignment Heatmap