Low-rank Solutions of Linear Matrix Equations via Procrustes Flow

Stephen Tu, Ross Boczar, Max Simchowitz, Mahdi Soltanolkotabi*, and Benjamin Recht.
UC Berkeley, *USC

$\newcommand{\T}{\mathsf{T}}$ $\newcommand{\R}{\mathbb{R}}$ $\newcommand{\E}{\mathbb{E}}$ $\newcommand{\X}{\mathcal{X}}$ $\newcommand{\H}{\mathcal{H}}$ $\newcommand{\twonorm}[1]{ \left\| #1 \right\|_{\ell_2} }$ $\newcommand{\norm}[1]{ \left\| #1 \right\| }$ $\newcommand{\ip}[2]{ \left\langle #1, #2 \right\rangle}$ $\newcommand{\abs}[1]{ \left| #1 \right| }$ $\newcommand{\ind}{ \mathbf{1} }$ $\newcommand{\A}{\mathcal{A}}$ $\newcommand{\B}{\mathcal{B}}$ $\DeclareMathOperator*{\argmin}{arg\,min}$ $\newcommand{\dist}{\mathrm{dist}}$ $\newcommand{\Tr}{\mathbf{Tr}}$

Low-rank matrix recovery: Given linear operator $\A(\cdot)$ and measurements $b = \A(M) \in \R^{m}$, would like to recover $M \in \R^{n_1 \times n_2}$.

$$\min_{X \in \R^{n_1 \times n_2}} \mathrm{rank}(X) \;\;\mathrm{s.t.}\;\; \A(X) = b \:.$$
$$\min_{X \in \R^{n_1 \times n_2}} \mathrm{rank}(X) \;\;\mathrm{s.t.}\;\; \A(X) = b \:.$$

Framework is general: matrix completion, phase retrieval, metric embedding, etc.

Not convinced?

Low-rank matrix recovery is NP-hard in general.

Procrustes flow: A procedure to estimate rank-$r$ $M$ to arbitrary accuracy from $m=\widetilde{\Omega}(nr)$ samples for special types of $\A(\cdot)$'s.

(Also known as gradient descent).

Which operators work?

Restricted isometry: For all matrices $M \in \R^{n \times n}$ with rank at most $r$, there exists $\delta_r$ s.t.

$$ (1-\delta_r) \norm{M}^2_F \leq \twonorm{\A(M)}^2 \leq (1+\delta_r) \norm{M}^2_F \:. $$

What operators are RIP (for constant $\delta$)?

Gaussian ensemble with $m = \Omega(nr)$.

Subsampled 2D-Fourier basis [1] with $m = \Omega(nr \log^4{n})$.

Subsampled Pauli basis [2] with $m = \Omega(nr \log^6{n})$.

[1] Oymak, Recht, and Soltanolkotabi (2015).

[2] Liu (2011).

Abridged history of RIP matrix sensing

Algorithm Workhorse Convergence Sample complexity
Nuclear norm minimization
SDP
SDP
- $nr$
Iterative hard thresholding
Gradient + rank-$r$ SVD
Gradient + rank-$r$ SVD
Gradient +
rank-$r$ SVD
$\log(1/\varepsilon)$ $nr$
Alternating minimization Least squares $\log(1/\varepsilon)$ $nr^2\kappa^4$

What if we just used gradient descent?

Low-rank recovery problem: $$ \min_{X \in \R^{n \times n}} \frac{1}{2} \twonorm{\A(X) - b}^2 \;\;\mathrm{s.t.}\;\; \mathrm{rank}(X) \leq r \:. $$

Procrustes flow: run gradient descent on the following

$$ \begin{align*} \min_{U \in \R^{n \times r}} f(U) &:= \frac{1}{4} \twonorm{ \A(UU^\T) - b }^2 \:. \end{align*} $$

Instance of popular Burer-Monteiro heuristic.

Non-convex, but I've seen worse

The surface of $f(U)$ in a $\R^{2 \times 1}$ case.

A complication: $f(U) = f(U {\color{red}R})$ for any orthogonal matrix $\color{red}R$.

Define distance using orthogonal Procrustes distance

$$ \dist(U, X) := \min_{\substack{{\color{red}R} \in \R^{r \times r} \\ {\color{red}RR^\T} = {\color{red}R^\T R} = I}} \norm{U - X{\color{red}R}}_F \:. $$

$\dist(U, X) \longrightarrow 0$ implies $\norm{UU^\T - XX^\T}_F \longrightarrow 0$.

Proof idea: Show strong convexity along trajectories given by optimal Procrustes rotation.

Main theorem: ($C_1, C_2$ below are absolute constants)

Let $U_0$ satisfy

$$ \dist(U_0, X) \leq \sigma_r^{1/2}/4 \:. $$
(Radius)

Iterating

$$ U_{t+1} \gets U_t - {\color{blue} \frac{C_1}{\sigma_1(U_0)^2}} \underbrace{ \color{green} \A^*( \A(U_t U_t^\T) - b ) U_t}_{\nabla f(U_t)} \:, $$
(Update)

achieves

$$ \definecolor{rate}{rgb}{0.59, 0.29, 0.0} \dist^2(U_\tau, X) \leq \left( 1 - \frac{C_2}{\color{rate} \kappa} \right)^{\tau} \dist^2(U_0, X) \:. $$

What about initialization?

Use $\log(r^{1/2} \kappa)$ iterations of hard-thresholding.

Can use approx SVD (see Hegde, Indyk, and Schmidt (2016)).

You don't even need to! All local min are global min, and all saddle points have negative curvature.

Matrix perturbation insights

Working with the factorized $U$ requires converting from $\dist(U, X)$ to $\norm{UU^\T - XX^\T}_F$ and vice versa.

(1) $\dist(U, X)$ small $\Longrightarrow$ $\norm{UU^\T - XX^\T}_F$ small is easy.

(2) $\norm{UU^\T - XX^\T}_F$ small $\Longrightarrow$ $\dist(U, X)$ small requires more work.

(Standard matrix perturbation arguments are pessimistic by a condition number).

We establish that

$$ \dist^2(U, X) \lesssim {\color{green} \frac{1}{\sigma_r(X)^2}} \norm{UU^\T - XX^\T}^2_F \:. $$

(This inequality has come up many extensions of our work, e.g. Bhojanapalli et al.).

For $M_\ell = U_\ell \Sigma_\ell V_\ell^\T$, $\ell=1,2$,

$$ \dist^2\left(\begin{bmatrix} U_2\Sigma_2^{1/2} \\ V_2\Sigma_2^{1/2} \end{bmatrix}, \begin{bmatrix} U_1\Sigma_1^{1/2} \\ V_1\Sigma_1^{1/2} \end{bmatrix}\right) \lesssim {\color{green} \frac{1}{\sigma_r(M_1)}} \norm{M_2 - M_1}^2_F \:. $$

Conclusion

Gradient descent for RIP matrix sensing converges linearly to the unknown matrix $M$.

Sample complexity of $\Omega(nr)$ for the Gaussian ensemble, and $\widetilde{\Omega}(nr)$ for other structured ensembles.

Poster session Wed 10am-1pm.