**Stephen Tu**, Shivaram Venkataraman, Ashia C. Wilson,

Alex Gittens, Michael I. Jordan, and Benjamin Recht.

$\newcommand{\T}{\mathsf{T}}$ $\newcommand{\R}{\mathbb{R}}$ $\newcommand{\E}{\mathbb{E}}$ $\newcommand{\X}{\mathcal{X}}$ $\newcommand{\H}{\mathcal{H}}$ $\newcommand{\twonorm}[1]{ \left\| #1 \right\|_{\ell_2} }$ $\newcommand{\fronorm}[1]{ \left\| #1 \right\|_{F} }$ $\newcommand{\norm}[1]{ \left\| #1 \right\| }$ $\newcommand{\ip}[2]{ \left\langle #1, #2 \right\rangle}$ $\newcommand{\abs}[1]{ \left| #1 \right| }$ $\newcommand{\ind}{ \mathbf{1} }$ $\newcommand{\A}{\mathcal{A}}$ $\newcommand{\B}{\mathcal{B}}$ $\DeclareMathOperator*{\argmin}{arg\,min}$ $\newcommand{\dist}{\mathrm{dist}}$ $\newcommand{\Tr}{\mathbf{Tr}}$

**Simple Goal:** Fix an $n \times n$ (dense) positive definite matrix $A$ and
response vector $b \in \R^n$. Our goal is to solve $Ax=b$.

**Caveat:** we want $n$ to be really big.

Motivating application is kernel ridge regression (KRR).

**KRR:**
Given dataset $\{(x_i, y_i)\}_{i=1}^{n}$ and kernel function $k(x, y)$,
the goal is to solve
$$
\min_{\alpha \in \R^n} \twonorm{ K\alpha - Y }^2 + \lambda \alpha^\T K \alpha \:.
$$

Solution is to solve $(K + \lambda I) \alpha = Y$. For large $n$ (i.e. $n \approx 10^6$), $K$ does not even fit in memory.

If the matrices do not fit in memory, we must turn to iterative methods.

Iterative methods for linear systems is a widely studied area.
Classical methods include *conjugate-gradient* (CG)
and *Gauss-Seidel* (GS).

This talk:A new accelerated variant of randomized GS designed to work with small blocks of data at a time.

Given a current iterate $x_k$ and a random block of coordinates $I \subseteq [n]$, update only the coordinates in $I$ via $$ (x_{k+1})_{I} = (x_k)_{I} - A_{II}^{-1} (A x_k - b)_{I} \:. $$

Equivalently, with sketching matrix $S_k \in \R^{n \times \abs{I}}$, $$ x_{k+1} = x_k - S_k (S_k^\T A S_k)^{-1} S_k^\T (A x_k - b) \:. $$

**Question:** How should we sample blocks?

We focus on two practical distributions for block GS.

**Fixed partition:** Choose blocksize $p$, divide $[n]$ into blocks
$I_1, ..., I_{n/p}$ *ahead of time*. During the iterates, randomly choose a block $I_{k_t}$
for $k_t \sim \mathrm{Unif}(\{1, ..., n/p\})$.

**Random coordinates:**
At each iteration, choose uniformly from the set
$\{ I \in 2^{[n]} : \abs{I} = p \}$.

Fixed partition is preferable from a systems perspective (better cache locality, can replicate blocks, etc.).

Random coordinates suffers from slower memory reads. So should we just use fixed partitioning?

A simple example where the sampling matters in block GS (similar to Lee and Wright, 16).

Define the $n \times n$ matrix $$ A_{\alpha,\beta} := \alpha I + \frac{\beta}{n} \ind\ind^\T \:. $$

Now, let us try $n=5000, p=500, \alpha=1, \beta=1000$.

To understand what is going on, let us look at the theory of randomized GS.

(Gower and Richtárik, 16). For all $k \geq 0$, $$ \begin{align*} \E\norm{x_k - x_*}_{A} \leq (1 - \mu)^{k/2} D_0 \:, \end{align*} $$

where $\mu := \lambda_{\min}(\E[ P_{A^{1/2} S} ])$.

The $\mu$ quantity varies as the sampling distribution changes.

$$ A_{\alpha,\beta} := \alpha I + \frac{\beta}{n} \ind\ind^\T \:. $$

$$ \begin{align*} \mu_{\mathrm{part}} &= \frac{p}{n+\beta p} \:, \\ \mu_{\mathrm{rand}} &= \mu_{\mathrm{part}} + \frac{p-1}{n-1} \frac{\beta p}{n + \beta p} \:. \end{align*} $$

As $\beta \longrightarrow \infty$,
$\mu_{\mathrm{part}} \approx 1/\beta$ whereas
$\mu_{\mathrm{rand}} \approx p/n$.
**Gap is arbitrarily large.**

**Systems point of view:** fixed-partition sampling is preferable.
Can cache blocks ahead of time, replicate across nodes, etc.
*Locality is good for performance!*

Optimization point of view:random coordinates is preferable. Each iteration of GS will make more progress.Locality is bad for optimization!

**Q1:** Does the same sampling phenomenon occur with acceleration?

**Q2:** Does acceleration provide the $\sqrt{\mu}$ behavior we expect?

Add a Nesterov momentum step to the iterates.

$$ \begin{align*} x_{k+1} &= \frac{1}{1+\tau} y_k + \frac{\tau}{1+\tau} z_k \:, \\ H_k &= S_k(S_k^\T A S_k)^{-1} S_k^\T \:, \\ y_{k+1} &= x_{k+1} - H_k(A x_{k+1} - b) \:, \\ z_{k+1} &= z_k + \tau(x_{k+1} - z_k) - \frac{\tau}{\mu} H_k(A x_{k+1} - b) \:. \end{align*} $$

Parameters $\tau, \mu$ are carefully chosen.

(Nesterov and Stich, 16). For all $k \geq 0$, accelerated block GS with fixed-partition sampling satisfies $$ \E\norm{y_k - x_*}_A \lesssim \left(1 - \sqrt{\frac{p}{n} \mu_{\mathrm{part}}} \right)^{k/2} D_0 \:. $$

Here, $\mu_{\mathrm{part}} = \E[ P_{A^{1/2}} S]$, where $S \in \R^{n \times p}$ represents fixed-partition sampling.

Loses $\sqrt{n/p}$ factor over "ideal" Nesterov rate.

$$ \E\norm{y_k - x_*}_A \lesssim \left(1 - \sqrt{\frac{p}{n} \mu_{\mathrm{part}}} \right)^{k/2} D_0 \:. $$

**Question:** does the guarantee hold
for *other* sampling schemes, with
$\mu_{\mathrm{part}}$ replaced by $\mu = \E[ P_{A^{1/2} S} ]$?

(Main result). For all $k \geq 0$, accelerated block GS with any(non-degenerate)sampling scheme satisfies $$ \E\norm{y_k - x_*}_A \lesssim (1-\tau)^{k/2} D_0 \:. $$

Here, $\tau := \sqrt{ \frac{1}{\nu} \cdot \mu}$, where $\mu$ is as before, and $\nu$ is a new quantity which behaviesroughlylike $n/p$.

We also prove the **rate is sharp**--
there exists a starting point which matches the rate up to constants.

For fixed-partition sampling, we can prove that $\nu = n/p$, thus recovering Nesterov and Stich's earlier result (combined with the sharpness of the rate, this proves that the $\sqrt{n/p}$ loss over ideal Nesterov rate is real).

For random coordinate sampling, we can only prove the weaker claim $$ \nu \leq \frac{n}{p} \max_{\abs{J} = p} \frac{\max_{i \in J} A_{ii}}{\lambda_{\min}(A_{JJ})} \:. $$

Empirically, $\nu$ never strays too far from $n/p$, so we leave this as open.

Block sampling distribution is very important for GS
(need to trade-off **systems** performance with **optimization**
performance).

We generalize the theory of block GS to the **accelerated case**.

Poster sessionMon 6:30pm-10:00pm.