The figure illustrates our action-constrained RL framework. When encoding CTU $i$, a state $s_i$ is first evaluated. With the state as input, our RL agent outputs an action (QP). The x265 codec then encodes CTU $i$, we then evaluate a distortion reward and a rate reward. These steps are repeated until all the CTUs in a frame are encoded. At training time, the agent interacts with x265 by encoding every frame as an episodic task. We utilize the distortion and rate critics to predict the distortion and rate reward-to-go, respectively. The rate critic, which predicts the rate deviation from target bitrate at the end of encoding a frame, enables us to specify a feasible set $\mathcal{C}(s_i)$ of action. The distortion critic, which estimates the cumulative distortion, guides the agent to minimize the total ROI-weighted distortion.
In our proposed method, we first identify the feasible set $\mathcal{C}(s_i)$ by the rate critic. To satisfy target bitrate, $\mathcal{C}(s_i)$ includes the QP values $QP_i$ that the rate reward-to-go $Q_R$ is greater than or equal to a threshold $\epsilon$ (see figure (a)). Then, we utilize NFWPO to update the actor network in three consecutive steps. First, it identifies a feasible update direction $\bar{c}(s)$ through distortion reward-to-go $Q_D$ and feasible set. Second, a reference action $\tilde{a_{s_i}}$ is evaluated by taking a small step along the update direction from projected initial action $\prod\nolimits_{\mathcal{C}(s)}(\pi(s))$. Lastly, it learns the actor network through gradient decent by minimizing the squared error between the reference action.
The reconstruction quality and QP assignment comparisons on images selected from DAVIS and COCO dataset. The region highlighted by red outlines are the region of interest. Our method preserves more texture details in ROI and shows less blocking artifacts by assigns lower QPs in ROI CTUs. Click on image to enlarge it.