Data-Mining-Lec12

Recap

KKT, Lagrangian

\[L(w, b, \alpha, \beta) = f(w,b) + \sum_{i=1}^{N}{\alpha_ig_i(w,b)} + \sum_{i=1}^{N}{\beta_ih_i(w,b)}\]

\[\theta(w,b) = \max_{\alpha, \beta} L(w, b, \alpha, \beta) \]

\[\theta(w,b) = \begin{cases} f(w,b) \text{\: if feasible} \\ \infty \text{\: otherwise} \end{cases}\]

\[\min_{w,b}\theta(w,b) = \min_{w,b}\max_{\alpha, \beta, \alpha_1 \geq 0} L(w, b, \alpha, \beta) \]

if \(f, g\) are convex, ?affine

\[ \min_{w,b}\max_{\alpha, \beta, \alpha_1 \geq 0} L(w, b, \alpha, \beta) = \max_{w,b}\min_{\alpha, \beta, \alpha_i \geq 0} L(w, b, \alpha, \beta) \]

Constraints, partial derivatives of \(w, b, \beta_i, \alpha_i\) are all \(0\).

SVM-primal

\[\min_{w,b} \frac12 \|w\|^2 \\ \text{s.t. \quad} l^i(w^Tx_i+b) \geq 1 \text{\quad for i=1, ...N}\]

\[g_i(w,b) = 1-l^i(w^Tx_i+b) \leq 0\\ \to L(w, b, \alpha) = \frac12 \|w\|^2 - \sum_{i=1}^m{\alpha_i (l^i(w^Tx_i+b-1))} \]

\[\nabla_wL(w, b, \alpha) = w - \sum_{i=1}^{m}\alpha_il^ix^i=0 \\ \to w = \sum_{i=1}^{m}\alpha_il^ix^i\]

\[\nabla_bL(w, b, \alpha) = \sum_{i=1}^{m}\alpha_il^i = 0 \]

Put back to lagrangian,

\[ \begin{aligned} L(w,b,\alpha) &= \frac12 \|\sum_{i=1}^{m}\alpha_il^ix^i\|^2 - \sum_{i=1}^m{\alpha_i (l^i(w^Tx_i+b-1))} \\ &= -\frac12 \|\sum_{i=1}^{m}\alpha_il^ix^i\|^2 + \sum_{i=1}^m{\alpha_i} \end{aligned}\]

Then the goal is

\[\max -\frac12 \|\sum_{i=1}^{m}\alpha_il^ix^i\|^2 + \sum_{i=1}^m{\alpha_i} \\ \alpha_i \geq 0, \quad \sum_{i=1}^N{\alpha_il^i}=0\]

Without loss of generality, let us assume that in the \(i^{th}\) iteration of the optimization procedure for the DUAL-SVM, we optimize over: \(\alpha_1, \alpha_2\)

\[\alpha_1l^1 + \alpha_2l^2 = -\sum_{i=3}^N{\alpha_il^i}\]

\[\alpha_2 = \frac{-\sum_{i=3}^N{\alpha_il^i} - \alpha_1l^1 }{l^2} \]

Each step freeze 2, optimize others. (Heuristic)

Kernel Ridge Regression

\(\{x_i\}, \{y_i\}\)

Cost function (linear) \[ C(w) = \frac12 \sum_{i=1}^N{(y_i - w^Tx_i)^2} \]

\[H: w^Tx + b = 0\]

\(x_i' = x_i \text{\: concating 1}\)

Optimization for standard linear regression is just minimize \(C(w)\)

In ridge regression,

\[\min_w C(w) = \frac12 \sum_{i=1}^N{(y_i - w^Tx_i)^2} + \frac12 \lambda \|w\|_2^2\]

\[\nabla_{w^*}C(w^*) = 0\]

\[w^* = \big(\lambda I + \sum_{i=1}^N{x_ix_i^T}\big) ^{-1}\big(\sum_{j=1}^N{y_jx_j^T}\big)\]

Time complexity:
- each \(x_ix_i^T\) is a matrix in \(\mathbb{R}^{d\times d}\),
- \(\sum_N x_ix_i^T\) takes \(Nd^2\)
- \(\big(\lambda I + \sum_{i=1}^N{x_ix_i^T}\big) ^{-1}\) takese \(d^3\)

Approximate kernel Method \[x_i \to \phi(x_i)\]

\[w = \big(\lambda I + \sum_{i=1}^N{\phi(x_i)\phi(x_i)^T}\big) ^{-1}\big(\sum_{j=1}^N{y_j\phi(x_j)^T}\big)\]

Matrix Identity

P, B, R

\[(P^{-1} + B^TR^{-1}B)^{-1}B^TR^{-1} = PB^T(BPB^T+R)^{-1}\]

\(B\in\mathbb{R}^{r\times p}\)

\[B^TR^{-1} = (P^{-1} + B^TR^{-1}B)PB^T(BPB^T+R)^{-1}\]

\[B^TR^{-1}(BPB^T+R) = (P^{-1}+B^TR^{-1}-B)PB^T\]

\[w = (\lambda I +\phi\phi^{-1})\phi_y = \phi(\phi^T\phi + \lambda I)^{-1}y\]

\[\phi(X)^T\phi(X) \in \mathbb{R}^{N\times N}\]

Predictions

\[x\in\mathbb{R}^d, y=w^T\phi(x)\]

number of random features \[m = \theta(d\log d)\]

G: Gaussian unstructured matrix \(\in \mathbb{R}^{d\times d}\)

Get Gaussian Orthogonal Matrix
- Gram-schmidt orthogonalization approach:
  - rows of l2-norm 1
  - Apply sample scalars: s1, ..., sd independently from \(\chi(d)\) distribution and apply them to renormalize rows (Or simply \(\sqrt{d}\), but biased)
- random hadamard matrices
  - \(\frac1{\sqrt{d}} HD_1 ... \frac1{\sqrt{d}} HD_L\)
  - renormalize
- Givens rotations
  - also renormalize