Can we use equivariance to disentangle components of a video?

Background

  • As explored in ICML 2020, an equivariant renderer \(g\) is one where \begin{align}z &= f(x)\\g(z) &= x\\g(T^z(z)) = g(T^z(f(x)) &= T^x(g(z)) = T^x(x)\end{align}
  • It learns by training \(g, f\) to minimize \(\|g(T^z(f(x)) - T^x(x)\|\), where \(T^x\) is a known rotation of the input and \(T^z\) is the same rotation but in \(z\) space. Afterwards, \(g\) can render unknown viewpoints just by transforming \(f(x)\).
  • Note that in a video we have several parts which can be placed into two sets: C and B. Here, C is the set of moving characters and B is the static background (everything else).
  • Further, at an appropriate level of granularity, the change in a character’s position between any sufficiently close frames \(f_t\) and \(f_{t+1}\) can be modeled as an affine transformation \(T\). This is a model of the movement and is of course prone to mistakes, but it’s a reasonable assumption that should hold better as we increase the frames per second.

Hypothesis

  • From this model, we hypothesize that we can disentangle a scene into characters and background by training a renderer to be equivariant to affine transformations of a character and invariant to transformations of the background.
  • In other words, assuming a single moving character, we train \(g\), \(f\), and \(h\) s.t. \begin{align}g(T^z(f(x)) + h(x)) = T^x(x) = x’\end{align}
  • To do this, we utilize one main scene loss \(L_s\) along with two supplementary losses \(L_c\) and \(L_b\) representing the character and background respectively:\begin{align}L_s &= |g(T^z(f(x)) + h(x)) - x|\\L_c &= |T^z(f(x)) - f(x’)|\\L_b &= |h(x’) - h(x)|\end{align}
  • The domains and ranges are as follows:
    • \(f\): image of shape \([h, w, 3] \rightarrow\) feature of shape \([h', w', c']\)
    • \(h\): image of shape \([h, w, 3] \rightarrow\) feature of shape \([h', w', c']\)
    • \(T^z\): <see variations below> \(\rightarrow \mathbb{R}^6\)
    • \(g\): feature of shape \([h', w', c'] \rightarrow [h, w, 3]\)
  • Because \(f\) is varying, it should account for the varying parts of the sequence (the character) and because \(h\) is constant, it should account for the non-varying parts, the background. This has hints of slow feature analysis.
  • Afterward, we now have the ability to render the character in any affine way on any of the trained backgrounds by mixing the backgrounds via \(h(x^k)\) and the affine transformations of \(f(x^j)\). This can let us make new videos with this character.
  • Note that the \(T^z\) acts on \(f\) like a spatial transformer. Here, \(f\) is not a flat encoding but rather the output of a convolutional step (so \(H' \times W' \times C'\)) and so \(T^z\) applies an affine transformation to each \(H' \times W'\) part.

Variations

  • Assuming a 2d image sequence, \(T^z\) and \(T^x\) are both \(2 \times 3\) affine matrices. What is the domain of \(T^z\)?
    • One variation is that it is a function of just \(T^x\), i.e. \(T^z: \mathbb{R}^6 \rightarrow \mathbb{R}^6\). In that case, we need to know the affine transformation applied to the character during training.
    • Another variation is that it is a function of \(x\) and \(x'\), so \(T^z: \mathbb{R}^{[H \times W \times 3 \times 2]} \rightarrow \mathbb{R}^6\). This variation allows us to learn from just sequences and thus create a catalog of template \(T^z\) with which we can render the character from scene to scene in a similar manner as it was rendered elsewhere.
    • And a third variation is that \(T^z\) is a function of \(f(x)\) and \(f(x')\), so \(T^z: R^{[H' \times W' \times C' \times 2]} \rightarrow \mathbb{R}^6\). This variation is arguably more founded than the above one because it means that \(T^z\) does not need to relearn how to separate \(x\) from \(x'\) and can instead take advantage of how \(f\) needs to learn that.
    • We ultimately want \(g(T^z(f(x), f(x'))f(x'') + h(x''))\) to render \(x''\) as if it was transformed by the same transformation that took \(x \rightarrow x'\).