Low-rank Solutions of Linear Matrix Equations via Procrustes Flow

Stephen Tu, Ross Boczar, Max Simchowitz, Mahdi Soltanolkotabi*, and Benjamin Recht.
UC Berkeley, *USC

$\newcommand{\T}{\mathsf{T}}$ $\newcommand{\R}{\mathbb{R}}$ $\newcommand{\E}{\mathbb{E}}$ $\newcommand{\X}{\mathcal{X}}$ $\newcommand{\H}{\mathcal{H}}$ $\newcommand{\twonorm}[1]{ \left\| #1 \right\|_{\ell_2} }$ $\newcommand{\norm}[1]{ \left\| #1 \right\| }$ $\newcommand{\ip}[2]{ \left\langle #1, #2 \right\rangle}$ $\newcommand{\abs}[1]{ \left| #1 \right| }$ $\newcommand{\ind}{ \mathbf{1} }$ $\newcommand{\A}{\mathcal{A}}$ $\newcommand{\B}{\mathcal{B}}$ $\DeclareMathOperator*{\argmin}{arg\,min}$ $\newcommand{\dist}{\mathrm{dist}}$ $\newcommand{\Tr}{\mathbf{Tr}}$

Low-rank matrix recovery: Given linear operator $\A(\cdot)$ and measurements $b = \A(M) \in \R^{m}$, would like to recover $M \in \R^{n_1 \times n_2}$.

$$\min_{X \in \R^{n_1 \times n_2}} \mathrm{rank}(X) \;\;\mathrm{s.t.}\;\; \A(X) = b \:.$$

Framework is general: matrix completion, phase retrieval, metric embedding, etc.

Not convinced?

Low-rank matrix recovery is NP-hard in general.

Procrustes flow: A procedure to estimate rank-$r$ $M$ to arbitrary accuracy from $m=\widetilde{\Omega}(nr)$ samples for special types of $\A(\cdot)$'s.

(Also known as gradient descent).

Which operators work?

Restricted isometry: For all matrices $M \in \R^{n \times n}$ with rank at most $r$, there exists $\delta_r$ s.t.

$$ (1-\delta_r) \norm{M}^2_F \leq \twonorm{\A(M)}^2 \leq (1+\delta_r) \norm{M}^2_F \:. $$

What operators are RIP (for constant $\delta$)?

Gaussian ensemble with $m = \Omega(nr)$.

Subsampled 2D-Fourier basis [1] with $m = \Omega(nr \log^4{n})$.

Subsampled Pauli basis [2] with $m = \Omega(nr \log^6{n})$.

[1] Oymak, Recht, and Soltanolkotabi (2015).

[2] Liu (2011).

Abridged history of RIP matrix sensing

Algorithm	Workhorse	Convergence	Sample complexity
Nuclear norm minimization	SDP SDP	-	$nr$
Iterative hard thresholding	Gradient + rank-$r$ SVD Gradient + rank-$r$ SVD Gradient + rank-$r$ SVD	$\log(1/\varepsilon)$	$nr$
Alternating minimization	Least squares	$\log(1/\varepsilon)$	$nr^2\kappa^4$

[1] Recht, Fazel, and Parillo (2007).

[2] Meka, Jain, and Dhillon (2009).

[3] Jain, Netrapalli, and Sanghavi (2012).

What if we just used gradient descent?

Low-rank recovery problem: $$ \min_{X \in \R^{n \times n}} \frac{1}{2} \twonorm{\A(X) - b}^2 \;\;\mathrm{s.t.}\;\; \mathrm{rank}(X) \leq r \:. $$

Procrustes flow: run gradient descent on the following

$$ \begin{align*} \min_{U \in \R^{n \times r}} f(U) &:= \frac{1}{4} \twonorm{ \A(UU^\T) - b }^2 \:. \end{align*} $$

Instance of popular Burer-Monteiro heuristic.

Non-convex, but I've seen worse

The surface of $f(U)$ in a $\R^{2 \times 1}$ case.

A complication: $f(U) = f(U {\color{red}R})$ for any orthogonal matrix $\color{red}R$.

Define distance using orthogonal Procrustes distance

$$ \dist(U, X) := \min_{\substack{{\color{red}R} \in \R^{r \times r} \\ {\color{red}RR^\T} = {\color{red}R^\T R} = I}} \norm{U - X{\color{red}R}}_F \:. $$

$\dist(U, X) \longrightarrow 0$ implies $\norm{UU^\T - XX^\T}_F \longrightarrow 0$.

Proof idea: Show strong convexity along trajectories given by optimal Procrustes rotation.

Main theorem: ($C_1, C_2$ below are absolute constants)

Let $U_0$ satisfy

$$ \dist(U_0, X) \leq \sigma_r^{1/2}/4 \:. $$
(Radius)

Iterating

$$ U_{t+1} \gets U_t - {\color{blue} \frac{C_1}{\sigma_1(U_0)^2}} \underbrace{ \color{green} \A^*( \A(U_t U_t^\T) - b ) U_t}_{\nabla f(U_t)} \:, $$
(Update)

achieves

$$ \definecolor{rate}{rgb}{0.59, 0.29, 0.0} \dist^2(U_\tau, X) \leq \left( 1 - \frac{C_2}{\color{rate} \kappa} \right)^{\tau} \dist^2(U_0, X) \:. $$

What about initialization?

Use $\log(r^{1/2} \kappa)$ iterations of hard-thresholding.

Can use approx SVD (see Hegde, Indyk, and Schmidt (2016)).

You don't even need to! All local min are global min, and all saddle points have negative curvature.

https://arxiv.org/abs/1605.07221

Matrix perturbation insights

Working with the factorized $U$ requires converting from $\dist(U, X)$ to $\norm{UU^\T - XX^\T}_F$ and vice versa.

(1) $\dist(U, X)$ small $\Longrightarrow$ $\norm{UU^\T - XX^\T}_F$ small is easy.

(2) $\norm{UU^\T - XX^\T}_F$ small $\Longrightarrow$ $\dist(U, X)$ small requires more work.

(Standard matrix perturbation arguments are pessimistic by a condition number).

We establish that

$$ \dist^2(U, X) \lesssim {\color{green} \frac{1}{\sigma_r(X)^2}} \norm{UU^\T - XX^\T}^2_F \:. $$

(This inequality has come up many extensions of our work, e.g. Bhojanapalli et al.).

For $M_\ell = U_\ell \Sigma_\ell V_\ell^\T$, $\ell=1,2$,

$$ \dist^2\left(\begin{bmatrix} U_2\Sigma_2^{1/2} \\ V_2\Sigma_2^{1/2} \end{bmatrix}, \begin{bmatrix} U_1\Sigma_1^{1/2} \\ V_1\Sigma_1^{1/2} \end{bmatrix}\right) \lesssim {\color{green} \frac{1}{\sigma_r(M_1)}} \norm{M_2 - M_1}^2_F \:. $$

Conclusion

Gradient descent for RIP matrix sensing converges linearly to the unknown matrix $M$.

Sample complexity of $\Omega(nr)$ for the Gaussian ensemble, and $\widetilde{\Omega}(nr)$ for other structured ensembles.

Poster session Wed 10am-1pm.

@stephenltu

stephent@berkeley.edu
Paper: abs/1507.03566