# Background

• As explored in ICML 2020, an equivariant renderer $$g$$ is one where \begin{align}z &= f(x)\\g(z) &= x\\g(T^z(z)) = g(T^z(f(x)) &= T^x(g(z)) = T^x(x)\end{align}
• It learns by training $$g, f$$ to minimize $$\|g(T^z(f(x)) - T^x(x)\|$$, where $$T^x$$ is a known rotation of the input and $$T^z$$ is the same rotation but in $$z$$ space. Afterwards, $$g$$ can render unknown viewpoints just by transforming $$f(x)$$.
• Note that in a video we have several parts which can be placed into two sets: C and B. Here, C is the set of moving characters and B is the static background (everything else).
• Further, at an appropriate level of granularity, the change in a character’s position between any sufficiently close frames $$f_t$$ and $$f_{t+1}$$ can be modeled as an affine transformation $$T$$. This is a model of the movement and is of course prone to mistakes, but it’s a reasonable assumption that should hold better as we increase the frames per second.

# Hypothesis

• From this model, we hypothesize that we can disentangle a scene into characters and background by training a renderer to be equivariant to affine transformations of a character and invariant to transformations of the background.
• In other words, assuming a single moving character, we train $$g$$, $$f$$, and $$h$$ s.t. \begin{align}g(T^z(f(x)) + h(x)) = T^x(x) = x’\end{align}
• To do this, we utilize one main scene loss $$L_s$$ along with two supplementary losses $$L_c$$ and $$L_b$$ representing the character and background respectively:\begin{align}L_s &= |g(T^z(f(x)) + h(x)) - x|\\L_c &= |T^z(f(x)) - f(x’)|\\L_b &= |h(x’) - h(x)|\end{align}
• The domains and ranges are as follows:
• $$f$$: image of shape $$[h, w, 3] \rightarrow$$ feature of shape $$[h', w', c']$$
• $$h$$: image of shape $$[h, w, 3] \rightarrow$$ feature of shape $$[h', w', c']$$
• $$T^z$$: <see variations below> $$\rightarrow \mathbb{R}^6$$
• $$g$$: feature of shape $$[h', w', c'] \rightarrow [h, w, 3]$$
• Because $$f$$ is varying, it should account for the varying parts of the sequence (the character) and because $$h$$ is constant, it should account for the non-varying parts, the background. This has hints of slow feature analysis.
• Afterward, we now have the ability to render the character in any affine way on any of the trained backgrounds by mixing the backgrounds via $$h(x^k)$$ and the affine transformations of $$f(x^j)$$. This can let us make new videos with this character.
• Note that the $$T^z$$ acts on $$f$$ like a spatial transformer. Here, $$f$$ is not a flat encoding but rather the output of a convolutional step (so $$H' \times W' \times C'$$) and so $$T^z$$ applies an affine transformation to each $$H' \times W'$$ part.

# Variations

• Assuming a 2d image sequence, $$T^z$$ and $$T^x$$ are both $$2 \times 3$$ affine matrices. What is the domain of $$T^z$$?
• One variation is that it is a function of just $$T^x$$, i.e. $$T^z: \mathbb{R}^6 \rightarrow \mathbb{R}^6$$. In that case, we need to know the affine transformation applied to the character during training.
• Another variation is that it is a function of $$x$$ and $$x'$$, so $$T^z: \mathbb{R}^{[H \times W \times 3 \times 2]} \rightarrow \mathbb{R}^6$$. This variation allows us to learn from just sequences and thus create a catalog of template $$T^z$$ with which we can render the character from scene to scene in a similar manner as it was rendered elsewhere.
• And a third variation is that $$T^z$$ is a function of $$f(x)$$ and $$f(x')$$, so $$T^z: R^{[H' \times W' \times C' \times 2]} \rightarrow \mathbb{R}^6$$. This variation is arguably more founded than the above one because it means that $$T^z$$ does not need to relearn how to separate $$x$$ from $$x'$$ and can instead take advantage of how $$f$$ needs to learn that.
• We ultimately want $$g(T^z(f(x), f(x'))f(x'') + h(x''))$$ to render $$x''$$ as if it was transformed by the same transformation that took $$x \rightarrow x'$$.