# HW3

Hua Yao, UNI:hy2632

## Problem 1: SVM algorithm in action [50 points]

### Description of the algorithm:

1. Used dual SVM

2. Did not consider regularization, a.k.a. set $$C=\infty$$, because this problem (binary classification between 0/9) should be separable, and the representation of $$b$$ becomes nasty with regularization.

3. Used the SMO (sequential minimal optimization) algorithm. During optimization, within each iteration, randomly select $$\alpha_1, \alpha_2$$, optimize the QP w.r.t. $$\alpha_2$$ and update $$\alpha_1$$ accordingly. Added constraint $$\alpha_2 \geq 0$$ to the $$\alpha_2$$ optimizer. This does not constrain $$\alpha_1\geq 0$$ directly. However, with the randomization within each iteration, $$\alpha_i \geq 0$$ is satisfied when the whole optimzation over $$\alpha$$ finally converges.

4. Provide 2 options: Linear Kernel (baseline SVM) or Softmax kernel

$K_{SM}(x, y) = \exp{x^\top y}$

To avoid explosion on scale, normalized the input $$x$$.

Included the trigonometric feature map $$\phi(x)$$ of the softmax kernel for calculating $$b$$, (not for $$w$$ because when making prediction, we use kernel function instead of $$w^\top \phi$$).

Use exact kernel instead of approximating kernel with random feature map, because softmax's dimensionality is infinite. Directly compute the exponential in numpy is more efficient.

The prediction comes like this (vectorized version):

$y_{new} = K(X_{new}, X)\cdot (\alpha * y) + b$

Note that b is broadcasted to $$n'$$ new data points. $$K(X_{new}, X)$$ is $$n' \times n$$ kernel matrix. $$\alpha*y$$ means the elementwise product of two vectors.

5. SVM is expensive when $$n$$ is large. Here in practice, we trained on a small batch (default=64). The randomness here influences the performance.

6. Have a few trial runs to get the model with best prediction on the training data. Then it should give good prediction on the validation data. You might also need to tune the hyperparameters a little bit, like batch_size and tol.

### Training Example: Jupyter notebook file

#### "0"/"9" Data Extraction from MNIST

tensor(0)

#### Load Train & Val data and preprocess

HBox(children=(FloatProgress(value=0.0, max=200000.0), HTML(value='')))

>> Optimizing, step 1. Δalpha:6.0
>> Optimized on the batch, step 4446. 5 steps Δalpha:1.7763568394002505e-15

(0.9761623989218329, 0.9773755656108597)
HBox(children=(FloatProgress(value=0.0, max=200000.0), HTML(value='')))

>> Optimizing, step 1. Δalpha:15.273472888060688
>> Optimized on the batch, step 251. 5 steps Δalpha:0.0

(0.9834905660377359, 0.9803921568627451)

### Result:

Train Accuracy / Validation Accuracy:

• Linear SVM: (0.976, 0.977)
• Softmax Kernel SVM: (0.983, 0.980)

Note that the result varies over different trials because of the randomness in batch selection and ending condition.

## Problem 2: Kernel Ridge Regression [20 points]

### General Ridge Regression

$\hat{y} = X\theta$

$J(\theta) = \frac12(y-X\theta)^T(y-X\theta) + \frac{\lambda}2\|\theta\|^2$

Let

\begin{aligned} & \nabla_{\theta}J(\theta) = 0 \\ \to \quad & \theta = (X^TX + \lambda I_d)^{-1}X^Ty \end{aligned}

### Method 1: Approximate kernel with random features

When using feature map, substituting $$X$$ with $$\phi = \phi(X)$$ gives us

\begin{aligned} \theta &= (\phi^T\phi + \lambda I_m)^{-1}\phi^Ty \end{aligned}

When predicting,

\begin{aligned} y_{new} &= \phi(X_{new}) (\phi^T\phi + \lambda I_m)^{-1}\phi^Ty \end{aligned}

The time complexity of computing $$\theta$$ is $$O(\max{(mn^2,m^3,m^2n)})$$, considering the matrix inverse and the matrix multiplications.

### Method 2: Exact kernel

If we use matrix identity trick, we can change the dimensionality of computation.

\begin{aligned} \theta &= \phi^T(\phi\phi^T + \lambda I_n)^{-1}y \\ &= \phi^T\big(K + \lambda I_n\big)^{-1}y \end{aligned}

$$K = K(X,X)$$ is $$n\times n$$ kernel matrix of the training data. This gives the closed form solution of the weight.

When predicting, \begin{aligned} y_{new} &= \phi(X_{new})\phi(X)^T\big(K + \lambda I_n\big)^{-1}y \\ &= K(X_{new}, X) \big(K + \lambda I_n\big)^{-1}y \\ &= \kappa(X_{new}) \big(K + \lambda I_n\big)^{-1}y \end{aligned}

The time complexity of computing $$\theta$$ is $$O(\max{(n^2d, n^3, mn^2)})$$. Directly computing the kernel matrix takes $$O(n^2d)$$, matrix inverse $$O(n^3)$$, matrix multiplication $$O(mn^2), O(mn)$$

### Comparison: Time complexity of training

When $$m \gg n$$, approximate kernel takes

$O(\max{(mn^2,m^3,m^2n)}) = O(m^3)$

exact kernel method takes

$O(\max{(n^2d, n^3, mn^2)})= O(mn^2)$

$\because m^3 \gg mn^2$

$\therefore \text{time complexity of approximate kernel method is higher than the exact kernel method}$