This work addresses continuous space-time video super-resolution (C-STVSR) that aims to up-scale an input video both spatially and temporally by any scaling factors. One key challenge of C-STVSR is to propagate information temporally among the input video frames. To this end, we introduce a space-time local implicit neural function. It has the striking feature of learning forward motion for a continuum of pixels. We motivate the use of forward motion from the perspective of learning individual motion trajectories, as opposed to learning a mixture of motion trajectories with backward motion. To ease motion interpolation, we encode sparsely sampled forward motion extracted from the input video as the contextual input. Along with a reliability-aware splatting and decoding scheme, our framework, termed MoTIF, achieves the state-of-the-art performance on C-STVSR.
Our proposed MoTIF comprises four major components and operates as follows. First, given $I_{0}^L$ and $I_{1}^L$, (1) the encoder $E_I$ converts them into their latent representations $F_{0}^L,F_{1}^L,F_{(0,1)}^L \in \mathbb{R}^{C \times H \times W}$, where $F_{(0,1)}^L$ serves as a rough estimate of the feature of the target frame $I_t^H$. We adopt the off-the-shelf video-based encoder, which fuses information from both $I_{0}^L$ and $I_{1}^L$ in generating $F_{0}^L,F_{1}^L$ and $F_{(0,1)}^L$. Second, (2) the spatial local implicit neural function (S-INF) is queried to super-resolute $F_{0}^L,F_{1}^L$ as $F_{0}^H,F_{1}^H \in \mathbb{R}^{C \times H' \times W'}$, respectively. Third, considering $I_{0}^L$ as sitting at the origin in time, (3) the motion encoder $E_M$ encodes $M_{0 \rightarrow 1}^L \in \mathbb{R}^{2 \times H \times W}$--namely, the forward optical flow map capturing the forward motion from $I_{0}^L$ to $I_{1}^L$--together with its reliability map $Z_{0 \rightarrow 1}^L \in \mathbb{R}^{3 \times H \times W}$ into $T_0^{L} \in \mathbb{R}^{C \times H \times W}$. The optical flow estimation is not always perfect; $Z_{0 \rightarrow 1}^L$ indicates how reliable $M_{0 \rightarrow 1}^L$ is across spatial locations $(x,y)$ (Section~\ref{subsec:estimate_explicit_motion}). Forth, using $T_0^{L}$ as the motion latent, (4) our space-time local implicit neural function (ST-INF) renders a high-resolution, forward motion map $\hat{M}_{0 \rightarrow t}^H \in \mathbb{R}^{2 \times H' \times W'}$ and its reliability map $\hat{Z}_{0 \rightarrow t}^H \in \mathbb{R}^{H' \times W'}$ according to the query space-time coordinates $(x,y,t)$. $\hat{M}_{0 \rightarrow t}^H$ specifies the forward motion of the features in $F_{0}^H$ and is utilized to forward warp $F_{0}^H$ to $F_t^H$. The same motion encoding, rendering and warping processes are repeated for $I_{1}^L$, in aggregating temporally the information from all the reference frames. Lastly, we perform softmax splatting to create $F_t^H$ and $Z_t^H$, which are further combined with $F_{(0,1)}^H$ to decode the high-resolution video frame $\hat{I}_t^H$ at time $t$. $Z_t^H$ indicates how good $F_t^H$ is across spatial locations. It is used to condition the pixel-based decoding of the RGB values from $F_t^H$ and $F_{(0,1)}^H$.