stephentu's blog - Random assortment of things

This post derives the dual program as presented in the SDCA paper, in slightly more generality. It is entirely straightforward and hence omitted from their paper, but a nice calculation to have handy. $ \newcommand{\norm}[1]{\left\lVert #1 \right\rVert} \newcommand{\R}{\mathbb{R}} \newcommand{\T}{\mathsf{T}} $

The setup is as follows. Given $\{x_i\}_{i=1}^{n}$ with $x_i \in \R^{d}$, a $\lambda > 0$, $\phi_i(\cdot)$, $i=1,...,n$ as scalar convex functions, and $H \succ 0$ a $d \times d$ symmetric positive definite matrix, the standard risk minimization primal problem is defined as $$ \begin{align*} \min_{w \in \R^d} P(w) := \frac{1}{n} \sum_{i=1}^{n} \phi_i(w^\T x_i) + \frac{\lambda}{2} \norm{w}^2_H \:, \end{align*} $$ where $\norm{w}^2_H := w^\T H w$.

Derivation

We now derive the dual of this program. Introduce variables $\gamma_i := w^\T x_i$, $i=1,...,n$. The primal program is thus equivalent to $$ \begin{align*} \min_{\gamma \in \R^n, w \in \R^d} \frac{1}{n} \sum_{i=1}^{n} \phi_i(\gamma_i) + \frac{\lambda}{2} \norm{w}^2_H : \gamma_i = w^\T x_i \:. \end{align*} $$ Introducing multipliers $\alpha \in \R^n$, strong duality, in particular Slater's condition, yields $$ \begin{align*} P^* &:= \min_{\substack{\gamma \in \R^n, w \in \R^d \\ \gamma_i = w^\T x_i}} \frac{1}{n} \sum_{i=1}^{n} \phi_i(\gamma_i) + \frac{\lambda}{2} \norm{w}^2_H \\ &= \min_{\gamma \in \R^n, w \in \R^d} \max_{\alpha \in \R^n} \frac{1}{n} \sum_{i=1}^{n} \phi_i(\gamma_i) + \frac{\lambda}{2} \norm{w}^2_H + \frac{1}{n} \sum_{i=1}^{n} \alpha_i (\gamma_i - w^\T x_i) \\ &= \max_{\alpha \in \R^n} \min_{\gamma \in \R^n, w \in \R^d} \frac{1}{n} \sum_{i=1}^{n} \phi_i(\gamma_i) + \frac{\lambda}{2} \norm{w}^2_H + \frac{1}{n} \sum_{i=1}^{n} \alpha_i (\gamma_i - w^\T x_i) \\ &= \max_{\alpha \in \R^n} \min_{\gamma \in \R^n, w \in \R^d} \frac{1}{n} \sum_{i=1}^{n} (\phi_i(\gamma_i) + \alpha_i \gamma_i) + \frac{\lambda}{2} \norm{w}^2_H - \frac{1}{n} \sum_{i=1}^{n} \alpha_i w^\T x_i \:. \end{align*} $$ The first subproblem we encounter is $$ \begin{align*} \min_{w \in \R^d} \frac{\lambda}{2} \norm{w}^2_H - \frac{1}{n} \sum_{i=1}^{n} \alpha_i w^\T x_i \:. %, w(\alpha) := \frac{1}{n\lambda} \sum_{i=1}^{n} \alpha_i x_i \:. \end{align*} $$ The solution satisfies $Hw - \frac{1}{n\lambda} \sum_{i=1}^{n} \alpha_i x_i = 0$, yielding $w(\alpha) := \frac{1}{n\lambda} \sum_{i=1}^{n} \alpha_i H^{-1} x_i$. Some algebra gives us that $$ \begin{align*} \frac{\lambda}{2} \norm{w(\alpha)}^2_H - \frac{1}{n} \sum_{i=1}^{n} \alpha_i w(\alpha)^\T x_i = -\frac{\lambda}{2} \norm{ \frac{1}{n\lambda} \sum_{i=1}^{n} \alpha_i x_i }^2_{H^{-1}} \:. \end{align*} $$ Hence, continuing from above, $$ \begin{align*} P^* &= \max_{\alpha \in \R^n} \min_{\gamma \in \R^n} \frac{1}{n} \sum_{i=1}^{n} (\phi_i(\gamma_i) + \alpha_i \gamma_i) - \frac{\lambda}{2} \norm{ \frac{1}{n\lambda} \sum_{i=1}^{n} \alpha_i x_i }^2_{H^{-1}} \\ &= \max_{\alpha \in \R^n} \frac{1}{n} \sum_{i=1}^{n} \min_{\gamma_i \in \R} (\phi_i(\gamma_i) + \alpha_i \gamma_i) - \frac{\lambda}{2} \norm{ \frac{1}{n\lambda} \sum_{i=1}^{n} \alpha_i x_i }^2_{H^{-1}} \\ &= \max_{\alpha \in \R^n} \frac{1}{n} \sum_{i=1}^{n} -\max_{\gamma_i \in \R} (\gamma_i(-\alpha_i) - \phi_i(\gamma_i)) - \frac{\lambda}{2} \norm{ \frac{1}{n\lambda} \sum_{i=1}^{n} \alpha_i x_i }^2_{H^{-1}} \\ &= \max_{\alpha \in \R^n} D(\alpha) := \frac{1}{n} \sum_{i=1}^{n} -\phi^\star_i(-\alpha_i) - \frac{\lambda}{2} \norm{ \frac{1}{n\lambda} \sum_{i=1}^{n} \alpha_i x_i }^2_{H^{-1}} \:. \end{align*} $$ Above, $\phi^\star_i(y) := \sup_{x \in \R} (xy - \phi_i(x))$ is the Fenchel conjugate of $\phi_i(\cdot)$. This expression agrees with Equation 2 in the SDCA paper when we set $H = I_d$. In this case, we have that $$ \begin{equation} D(\alpha) = \frac{1}{n} \sum_{i=1}^{n} -\phi^\star_i(-\alpha_i) - \frac{1}{2\lambda n^2} \alpha^\T XX^\T \alpha \:, \label{eq:dual} \end{equation} $$ where $X \in \R^{n \times d}$ is the usual data matrix.

Least squares

Let $\phi_i(a) = \frac{1}{2} (a - y_i)^2$ for $y_i \in \R$. Also let $Y \in \R^n$ denote the vector of labels. The primal problem then reduces to the familiar least squares program $$ \begin{equation} \min_{w \in \R^d} \frac{1}{2n} \norm{X w - Y}^2 + \frac{\lambda}{2} \norm{w}^2 \:. \label{eq:primal_ls} \end{equation} $$ A simple calculation gives us that $\phi_i^\star(u) = \frac{1}{2} u^2 + u y_i$, and hence plugging into the formula for the dual program we recover $$ \begin{equation} \max_{\alpha \in \R^n} \frac{1}{n} Y^\T \alpha - \frac{1}{2n} \norm{\alpha}^2 - \frac{1}{2\lambda n^2} \alpha^\T XX^\T \alpha \:. \label{eq:dual_ls} \end{equation} $$ I actually think looking at the least squares problem gives one of the clearest explanation of the difference between the primal/dual problem for risk minimization. Indeed, $\eqref{eq:primal_ls}$ is solving a $d \times d$ linear system, whereas $\eqref{eq:dual_ls}$ is solving a $n \times n$ linear system. When $d < n$, you probably want $\eqref{eq:primal_ls}$, and when $d \geq n$, you probably want $\eqref{eq:dual_ls}$. See this paper by Hefny, Needell and Ramdas for more discussion on this topic. One clear advantage of the dual, however, is discussed next.

Kernels in the dual

Equation $\eqref{eq:dual}$ depends on the data only through the gram matrix $XX^\T$. Hence, the usual trick of replacing $XX^\T$ with a kernel matrix applies.

Usually, the so-called "kernel trick" is introduced alongside discussing the dual of the SVM. The drawback of this approach is that it might lead some to erroneously believe that kernels only work for SVMs, which is not the case. I also think in chasing after the SVM dual, it is easy to get loss in the annoying details of the hinge function and lose sight of the bigger picture.

I like this approach better because it abstracts away details of the particular loss function, and makes it very clear that it is really the linear functional structure of the primal problem which makes the use of a kernel possible. Indeed, one can completely recover the SVM dual from $\eqref{eq:dual}$ by using the hinge loss $\phi_i(a) = \max\{0, 1-y_i a\}$; the conjugate function of the hinge nicely abstracts away the details during the dual derivation.

Deriving the dual program in SDCA

Derivation

Least squares

Kernels in the dual