Data-Mining-HW3

HW3

Hua Yao, UNI:hy2632

Problem 1: SVM algorithm in action [50 points]

Description of the algorithm:

  1. Used dual SVM

  2. Did not consider regularization, a.k.a. set \(C=\infty\), because this problem (binary classification between 0/9) should be separable, and the representation of \(b\) becomes nasty with regularization.

  3. Used the SMO (sequential minimal optimization) algorithm. During optimization, within each iteration, randomly select \(\alpha_1, \alpha_2\), optimize the QP w.r.t. \(\alpha_2\) and update \(\alpha_1\) accordingly. Added constraint \(\alpha_2 \geq 0\) to the \(\alpha_2\) optimizer. This does not constrain \(\alpha_1\geq 0\) directly. However, with the randomization within each iteration, \(\alpha_i \geq 0\) is satisfied when the whole optimzation over \(\alpha\) finally converges.

  4. Provide 2 options: Linear Kernel (baseline SVM) or Softmax kernel

    \[K_{SM}(x, y) = \exp{x^\top y}\]

    To avoid explosion on scale, normalized the input \(x\).

    Included the trigonometric feature map \(\phi(x)\) of the softmax kernel for calculating \(b\), (not for \(w\) because when making prediction, we use kernel function instead of \(w^\top \phi\)).

    Use exact kernel instead of approximating kernel with random feature map, because softmax's dimensionality is infinite. Directly compute the exponential in numpy is more efficient.

    The prediction comes like this (vectorized version):

    \[y_{new} = K(X_{new}, X)\cdot (\alpha * y) + b\]

    Note that b is broadcasted to \(n'\) new data points. \(K(X_{new}, X)\) is \(n' \times n\) kernel matrix. \(\alpha*y\) means the elementwise product of two vectors.

  5. SVM is expensive when \(n\) is large. Here in practice, we trained on a small batch (default=64). The randomness here influences the performance.

  6. Have a few trial runs to get the model with best prediction on the training data. Then it should give good prediction on the validation data. You might also need to tune the hyperparameters a little bit, like batch_size and tol.

Classes: Kernel and SVM

Kernel_FeatureMaps.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
import numpy as np


def gram_schmidt_columns(X):
Q, R = np.linalg.qr(X)
return Q


def orthgonalize(V):
N = V.shape[0]
d = V.shape[1]
turns = int(N / d)
remainder = N % d

V_ = np.zeros_like(V)

for i in range(turns):
v = gram_schmidt_columns(V[i * d:(i + 1) * d, :].T).T
V_[i * d:(i + 1) * d, :] = v
if remainder != 0:
V_[turns * d:, :] = gram_schmidt_columns(V[turns * d:, :].T).T

return V_


# Generate orthogonal normal weights (w1, ..., wm)
def generateGMatrix(m, d) -> np.array:
G = np.random.normal(0, 1, (m, d))
# Renormalize
norms = np.linalg.norm(G, axis=1).reshape([m, 1])
return orthgonalize(G) * norms


# Softmax trignometric feature map Φ(x)
def baseline_SM(x, G):
"""Calculate the result of softmax trigonometrix random feature mapping
Parameters
----------
x: array, dimension = n*d
Input to the baseline mapping
Required to be of norm 1 for i=1,...n

G: matrix, dimension = m*d
The matrix in the baseline random feature mapping
"""
m = G.shape[0]
left = np.cos(np.dot(x, G.T).astype(np.float32))
right = np.sin(np.dot(x, G.T).astype(np.float32))
return np.exp(0.5) * ((1 / m)**0.5) * np.hstack([left, right])


class Kernel(object):
"""Kernels.
Parameters
---------
x, y: array of (n_x, d), (n_y, d)
Return
---------
K(x,y): kernel matrix of (n_x, n_y), K_ij = K(x_i, y_j)
"""
@staticmethod
def Linear(x, y):
return np.dot(x, y.T)

@staticmethod
def Softmax(x, y):
return np.exp(np.dot(x, y.T))


class FeatureMap(object):
"""Feature mapping

Parameters
---------
x: array of (n, d)
Return
---------
Φ(x): array of (n, m), where m is usually a higher dimensionality. Here we set m = 2*n
"""
@staticmethod
def Linear(x):
return x

@staticmethod
def Softmax_Trigonometric(x):
n, d = x.shape
# Increase dimensionality to 2x
G = generateGMatrix(2 * d, d)
phi_x = baseline_SM(x, G)
return phi_x

SVM.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
import numpy as np
from scipy.optimize import minimize
from tqdm.notebook import tqdm
from Kernel_FeatureMaps import *;

class SVM(object):
"""
## Author:
Hua Yao (hy2632@columbia.edu)
## Description:
SVM binary classifier, optimizing with the dual lagrangian program, trained on a sample batch. Uses the SMO algorithm.
Normalizes input X to adapt to kernelized version.
Regularization parameter C set to be `np.inf` for easy closed form solution of b.
## Reference:
[CS229 - Kernel Methods and SVM](http://cs229.stanford.edu/notes2020fall/notes2020fall/cs229-notes3.pdf)
---
Parameters:
---------
X: (N, d)
Y: (N,)
kernel: kernel used, linear/softmax
featuremap: feature mapping corresponding to the kernel used
batch_size: int, also denoted as n
C: l1 regularization term for soft margin. alpha_i in [0, C]. Set as np.inf (no regularization), because the form of b is nasty under regularization.
tol = 1e-6: tolerance, deciding when to end training
Intermediate parameters:
---------
x: (n, d), random batch of X
y: (n)
phi_x: (n, m), feature map(s) of x
M: (n, n), M[i,j] = y_iy_j * K(x_i,x_j), hadamard product of y^Ty and K
Learned parameters:
---------
alpha: (n,)
w: (d,)
b: int
"""
def __init__(self,
X,
Y,
kernel=Kernel.Linear,
featuremap=FeatureMap.Linear,
batch_size=64,
C=np.inf,
tol=1e-6):
# C set as np.inf here -- no regularization

# X: (N,d), Y: (N,), x: (n,d), y:(n,)
# Fixed values
self.N, self.d = X.shape
# Normalize data
self.X = X
self.X = self.X / np.linalg.norm(self.X, axis=1, keepdims=True)
self.Y = Y
self.kernel = kernel
self.featuremap = featuremap
self.n = batch_size
self.C = C
self.tol = tol

batch_indices = np.random.choice(np.arange(self.N), self.n)
self.x = self.X[batch_indices]
self.y = self.Y[batch_indices]
self.phi_x = self.featuremap(self.x)
self.M = np.outer(self.y, self.y) * self.kernel(self.x, self.x)

# Learned parameters
self.alpha = np.ones(self.n)
self.w = np.zeros(self.d)
self.b = 0

def update_alpha(self, random_idx1, random_idx2):
Zeta = -np.sum(self.alpha * self.y) + (
self.alpha * self.y)[random_idx1] + (self.alpha *
self.y)[random_idx2]
self.alpha[random_idx1] = (Zeta - self.alpha[random_idx2] *
self.y[random_idx2]) * self.y[random_idx1]

def dual_obj(self, alpha):
return np.sum(alpha) - np.sum(0.5 * self.M * np.outer(alpha, alpha))

def fit(self, iterations=200000):
prev_val = self.alpha.copy()

for i in tqdm(range(iterations)):
# Select 2 alphas randomly
random_idx1, random_idx2 = np.random.choice(
np.arange(0, self.n), 2)

# The (quadratic w.r.t a2) function that scipy.optimize.minimize takes
def optimizeWRTa2(a2):
self.alpha[random_idx2] = a2
self.update_alpha(random_idx1, random_idx2)
return -self.dual_obj(self.alpha)

# Solve optimization w.r.t a2
a2 = self.alpha[random_idx2]
res = minimize(optimizeWRTa2, a2, bounds=[(0, self.C)])
a2 = res.x

# Update a2
self.alpha[random_idx2] = a2
self.update_alpha(random_idx1, random_idx2)

# Check convergence
if (i % 5 == 1):
if np.sum(np.abs(self.alpha - prev_val)) < self.tol:
print(
f">> Optimized on the batch, step {i}. 5 steps Δalpha:{np.sum(np.abs(self.alpha - prev_val))}"
)
return
else:
if (i % 5000 == 1):
print(
f">> Optimizing, step {i}. Δalpha:{np.sum(np.abs(self.alpha - prev_val))}"
)
prev_val = self.alpha.copy()

# Retrieve w and b
self.w = np.dot(self.alpha * self.y, self.phi_x)
# The form of b changes when there exists Regularization C. So we simply cancel C here.
# Look at P25 of CS229 - Note 3.
# Form of b under regularization depends on alpha. (http://cs229.stanford.edu/materials/smo.pdf)
# [Bias Term b in SVMs Again](https://www.elen.ucl.ac.be/Proceedings/esann/esannpdf/es2004-11.pdf) gives a general representation
self.b = (np.max(np.dot(self.phi_x, self.w)[self.y == -1]) +
np.min(np.dot(self.phi_x, self.w)[self.y == 1])) * -0.5

def predict(self, X_val):
X_val_normed = X_val / np.linalg.norm(X_val, axis=1, keepdims=True)
return np.sign(
np.dot(self.kernel(X_val_normed, self.x), self.alpha * self.y) +
self.b)

def score(self, X_val, y_val):
prediction = self.predict(X_val)
return np.mean(prediction == y_val)

Training Example: Jupyter notebook file

1
2
3
4
5
6
7
8
9
10
import numpy as np
from scipy.optimize import minimize
from tqdm.notebook import tqdm
import torch
from torchvision import datasets, transforms
import matplotlib.pyplot as plt
%matplotlib inline

from SVM import SVM
from Kernel_FeatureMaps import *

"0"/"9" Data Extraction from MNIST

1
2
3
4
5
6
7
8
9
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.5,),(0.5)),
])
trainset = datasets.MNIST('PATH_TO_STORE_TRAINSET', download=True, train=True, transform=transform)
valset = datasets.MNIST('PATH_TO_STORE_TESTSET', download=True, train=False, transform=transform)

trainloader = torch.utils.data.DataLoader(trainset)
valloader = torch.utils.data.DataLoader(valset)
1
2
3
4
ZeroNineTrain = [i for i in tqdm(trainset) if (i[1] == 0)|(i[1] == 9)]
X_train, y_train = zip(*ZeroNineTrain)
X_train = torch.stack(X_train)
y_train = torch.tensor(y_train)
1
2
3
4
ZeroNineVal = [i for i in tqdm(valset) if (i[1] == 0)|(i[1] == 9)]
X_val, y_val = zip(*ZeroNineVal)
X_val = torch.stack(X_val)
y_val = torch.tensor(y_val)
1
2
3
plt.imshow(X_train[0].squeeze(), cmap="gray_r")
plt.show()
y_train[0]
png
tensor(0)
1
2
3
4
torch.save(X_train, "X_train.pt")
torch.save(y_train, "y_train.pt")
torch.save(X_val, "X_val.pt")
torch.save(y_val, "y_val.pt")

Load Train & Val data and preprocess

1
2
3
4
5
6
7
8
9
10
# Load Data
X_train = torch.load("X_train.pt")
y_train = torch.load("y_train.pt")
X_val = torch.load("X_val.pt")
y_val = torch.load("y_val.pt")


# Criteria: Label0: +1, Label9: -1
y_train = (y_train==0)*2-1
y_val = (y_val==0)*2-1
1
2
3
# Formatting
X_train = X_train.squeeze().view((X_train.shape[0],-1)).numpy()
y_train = y_train.numpy()
1
2
X_val = X_val.squeeze().view((X_val.shape[0],-1)).numpy()
y_val = y_val.numpy()
1
2
3
svm_linear = SVM(X_train, y_train, Kernel.Linear, FeatureMap.Linear, 64, np.inf, 1e-4)
svm_linear.fit()
svm_linear.score(X_train, y_train), svm_linear.score(X_val, y_val)
HBox(children=(FloatProgress(value=0.0, max=200000.0), HTML(value='')))


>> Optimizing, step 1. Δalpha:6.0
>> Optimized on the batch, step 4446. 5 steps Δalpha:1.7763568394002505e-15

(0.9761623989218329, 0.9773755656108597)
1
2
3
svm_SM = SVM(X_train, y_train, Kernel.Softmax, FeatureMap.Softmax_Trigonometric, 64, np.inf, 1e-4)
svm_SM.fit()
svm_SM.score(X_train, y_train), svm_SM.score(X_val, y_val)
HBox(children=(FloatProgress(value=0.0, max=200000.0), HTML(value='')))


>> Optimizing, step 1. Δalpha:15.273472888060688
>> Optimized on the batch, step 251. 5 steps Δalpha:0.0

(0.9834905660377359, 0.9803921568627451)

Result:

Train Accuracy / Validation Accuracy:

  • Linear SVM: (0.976, 0.977)
  • Softmax Kernel SVM: (0.983, 0.980)

Note that the result varies over different trials because of the randomness in batch selection and ending condition.

Problem 2: Kernel Ridge Regression [20 points]

General Ridge Regression

\[\hat{y} = X\theta\]

\[J(\theta) = \frac12(y-X\theta)^T(y-X\theta) + \frac{\lambda}2\|\theta\|^2\]

Let

\[\begin{aligned} & \nabla_{\theta}J(\theta) = 0 \\ \to \quad & \theta = (X^TX + \lambda I_d)^{-1}X^Ty \end{aligned}\]

Method 1: Approximate kernel with random features

When using feature map, substituting \(X\) with \(\phi = \phi(X)\) gives us

\[\begin{aligned} \theta &= (\phi^T\phi + \lambda I_m)^{-1}\phi^Ty \end{aligned}\]

When predicting,

\[\begin{aligned} y_{new} &= \phi(X_{new}) (\phi^T\phi + \lambda I_m)^{-1}\phi^Ty \end{aligned}\]

The time complexity of computing \(\theta\) is \(O(\max{(mn^2,m^3,m^2n)})\), considering the matrix inverse and the matrix multiplications.

Method 2: Exact kernel

If we use matrix identity trick, we can change the dimensionality of computation.

\[\begin{aligned} \theta &= \phi^T(\phi\phi^T + \lambda I_n)^{-1}y \\ &= \phi^T\big(K + \lambda I_n\big)^{-1}y \end{aligned}\]

\(K = K(X,X)\) is \(n\times n\) kernel matrix of the training data. This gives the closed form solution of the weight.

When predicting, \[\begin{aligned} y_{new} &= \phi(X_{new})\phi(X)^T\big(K + \lambda I_n\big)^{-1}y \\ &= K(X_{new}, X) \big(K + \lambda I_n\big)^{-1}y \\ &= \kappa(X_{new}) \big(K + \lambda I_n\big)^{-1}y \end{aligned}\]

The time complexity of computing \(\theta\) is \(O(\max{(n^2d, n^3, mn^2)})\). Directly computing the kernel matrix takes \(O(n^2d)\), matrix inverse \(O(n^3)\), matrix multiplication \(O(mn^2), O(mn)\)

Comparison: Time complexity of training

When \(m \gg n\), approximate kernel takes

\[O(\max{(mn^2,m^3,m^2n)}) = O(m^3)\]

exact kernel method takes

\[O(\max{(n^2d, n^3, mn^2)})= O(mn^2)\]

\[\because m^3 \gg mn^2\]

\[\therefore \text{time complexity of approximate kernel method is higher than the exact kernel method}\]