HW3

Hua Yao, UNI:hy2632

Problem 1: SVM algorithm in action [50 points]

Description of the algorithm:

Used dual SVM
Did not consider regularization, a.k.a. set \(C=\infty\), because this problem (binary classification between 0/9) should be separable, and the representation of \(b\) becomes nasty with regularization.
Used the SMO (sequential minimal optimization) algorithm. During optimization, within each iteration, randomly select \(\alpha_1, \alpha_2\), optimize the QP w.r.t. \(\alpha_2\) and update \(\alpha_1\) accordingly. Added constraint \(\alpha_2 \geq 0\) to the \(\alpha_2\) optimizer. This does not constrain \(\alpha_1\geq 0\) directly. However, with the randomization within each iteration, \(\alpha_i \geq 0\) is satisfied when the whole optimzation over \(\alpha\) finally converges.
Provide 2 options: Linear Kernel (baseline SVM) or Softmax kernel

\[K_{SM}(x, y) = \exp{x^\top y}\]

To avoid explosion on scale, normalized the input \(x\).

Included the trigonometric feature map \(\phi(x)\) of the softmax kernel for calculating \(b\), (not for \(w\) because when making prediction, we use kernel function instead of \(w^\top \phi\)).

Use exact kernel instead of approximating kernel with random feature map, because softmax's dimensionality is infinite. Directly compute the exponential in numpy is more efficient.

The prediction comes like this (vectorized version):

\[y_{new} = K(X_{new}, X)\cdot (\alpha * y) + b\]

Note that b is broadcasted to \(n'\) new data points. \(K(X_{new}, X)\) is \(n' \times n\) kernel matrix. \(\alpha*y\) means the elementwise product of two vectors.
SVM is expensive when \(n\) is large. Here in practice, we trained on a small batch (default=64). The randomness here influences the performance.
Have a few trial runs to get the model with best prediction on the training data. Then it should give good prediction on the validation data. You might also need to tune the hyperparameters a little bit, like batch_size and tol.

Classes: Kernel and SVM

Kernel_FeatureMaps.py

import numpy as np


def gram_schmidt_columns(X):
    Q, R = np.linalg.qr(X)
    return Q


def orthgonalize(V):
    N = V.shape[0]
    d = V.shape[1]
    turns = int(N / d)
    remainder = N % d

    V_ = np.zeros_like(V)

    for i in range(turns):
        v = gram_schmidt_columns(V[i * d:(i + 1) * d, :].T).T
        V_[i * d:(i + 1) * d, :] = v
    if remainder != 0:
        V_[turns * d:, :] = gram_schmidt_columns(V[turns * d:, :].T).T

    return V_


# Generate orthogonal normal weights (w1, ..., wm)
def generateGMatrix(m, d) -> np.array:
    G = np.random.normal(0, 1, (m, d))
    # Renormalize
    norms = np.linalg.norm(G, axis=1).reshape([m, 1])
    return orthgonalize(G) * norms


# Softmax trignometric feature map Φ(x)
def baseline_SM(x, G):
    """Calculate the result of softmax trigonometrix random feature mapping
        Parameters
        ----------
        x: array, dimension = n*d
            Input to the baseline mapping
            Required to be of norm 1 for i=1,...n
        
        G: matrix, dimension = m*d
            The matrix in the baseline random feature mapping
    """
    m = G.shape[0]
    left = np.cos(np.dot(x, G.T).astype(np.float32))
    right = np.sin(np.dot(x, G.T).astype(np.float32))
    return np.exp(0.5) * ((1 / m)**0.5) * np.hstack([left, right])


class Kernel(object):
    """Kernels.
        Parameters
        ---------
        x, y: array of (n_x, d), (n_y, d)
        Return
        ---------
        K(x,y): kernel matrix of (n_x, n_y), K_ij = K(x_i, y_j)
    """
    @staticmethod
    def Linear(x, y):
        return np.dot(x, y.T)

    @staticmethod
    def Softmax(x, y):
        return np.exp(np.dot(x, y.T))


class FeatureMap(object):
    """Feature mapping
    
        Parameters
        ---------
        x: array of (n, d)
        Return
        ---------
        Φ(x): array of (n, m), where m is usually a higher dimensionality. Here we set m = 2*n
    """
    @staticmethod
    def Linear(x):
        return x

    @staticmethod
    def Softmax_Trigonometric(x):
        n, d = x.shape
        # Increase dimensionality to 2x
        G = generateGMatrix(2 * d, d)
        phi_x = baseline_SM(x, G)
        return phi_x

SVM.py

import numpy as np
from scipy.optimize import minimize
from tqdm.notebook import tqdm
from Kernel_FeatureMaps import *;

class SVM(object):
    """ 
    ## Author:
        Hua Yao (hy2632@columbia.edu)
    ## Description:
        SVM binary classifier, optimizing with the dual lagrangian program, trained on a sample batch. Uses the SMO algorithm.
        Normalizes input X to adapt to kernelized version.
        Regularization parameter C set to be `np.inf` for easy closed form solution of b.
    ## Reference:
        [CS229 - Kernel Methods and SVM](http://cs229.stanford.edu/notes2020fall/notes2020fall/cs229-notes3.pdf)
    ---
        Parameters:
        ---------
        X: (N, d)
        Y: (N,)
        kernel: kernel used, linear/softmax
        featuremap: feature mapping corresponding to the kernel used
        batch_size: int, also denoted as n
        C: l1 regularization term for soft margin. alpha_i in [0, C]. Set as np.inf (no regularization), because the form of b is nasty under regularization.
        tol = 1e-6: tolerance, deciding when to end training
        Intermediate parameters:
        ---------
        x: (n, d), random batch of X
        y: (n)
        phi_x: (n, m), feature map(s) of x
        M: (n, n), M[i,j] = y_iy_j * K(x_i,x_j), hadamard product of y^Ty and K
        Learned parameters:
        ---------
        alpha: (n,)
        w: (d,)
        b: int
    """
    def __init__(self,
                 X,
                 Y,
                 kernel=Kernel.Linear,
                 featuremap=FeatureMap.Linear,
                 batch_size=64,
                 C=np.inf,
                 tol=1e-6):
        # C set as np.inf here -- no regularization

        # X: (N,d), Y: (N,), x: (n,d), y:(n,)
        # Fixed values
        self.N, self.d = X.shape
        # Normalize data
        self.X = X
        self.X = self.X / np.linalg.norm(self.X, axis=1, keepdims=True)
        self.Y = Y
        self.kernel = kernel
        self.featuremap = featuremap
        self.n = batch_size
        self.C = C
        self.tol = tol

        batch_indices = np.random.choice(np.arange(self.N), self.n)
        self.x = self.X[batch_indices]
        self.y = self.Y[batch_indices]
        self.phi_x = self.featuremap(self.x)
        self.M = np.outer(self.y, self.y) * self.kernel(self.x, self.x)

        # Learned parameters
        self.alpha = np.ones(self.n)
        self.w = np.zeros(self.d)
        self.b = 0

    def update_alpha(self, random_idx1, random_idx2):
        Zeta = -np.sum(self.alpha * self.y) + (
            self.alpha * self.y)[random_idx1] + (self.alpha *
                                                 self.y)[random_idx2]
        self.alpha[random_idx1] = (Zeta - self.alpha[random_idx2] *
                                   self.y[random_idx2]) * self.y[random_idx1]

    def dual_obj(self, alpha):
        return np.sum(alpha) - np.sum(0.5 * self.M * np.outer(alpha, alpha))

    def fit(self, iterations=200000):
        prev_val = self.alpha.copy()

        for i in tqdm(range(iterations)):
            # Select 2 alphas randomly
            random_idx1, random_idx2 = np.random.choice(
                np.arange(0, self.n), 2)

            # The (quadratic w.r.t a2) function that scipy.optimize.minimize takes
            def optimizeWRTa2(a2):
                self.alpha[random_idx2] = a2
                self.update_alpha(random_idx1, random_idx2)
                return -self.dual_obj(self.alpha)

            # Solve optimization w.r.t a2
            a2 = self.alpha[random_idx2]
            res = minimize(optimizeWRTa2, a2, bounds=[(0, self.C)])
            a2 = res.x

            # Update a2
            self.alpha[random_idx2] = a2
            self.update_alpha(random_idx1, random_idx2)

            # Check convergence
            if (i % 5 == 1):
                if np.sum(np.abs(self.alpha - prev_val)) < self.tol:
                    print(
                        f">> Optimized on the batch, step {i}. 5 steps Δalpha:{np.sum(np.abs(self.alpha - prev_val))}"
                    )
                    return
                else:
                    if (i % 5000 == 1):
                        print(
                            f">> Optimizing, step {i}. Δalpha:{np.sum(np.abs(self.alpha - prev_val))}"
                        )
                    prev_val = self.alpha.copy()

        # Retrieve w and b
        self.w = np.dot(self.alpha * self.y, self.phi_x)
        # The form of b changes when there exists Regularization C. So we simply cancel C here.
        # Look at P25 of CS229 - Note 3.
        # Form of b under regularization depends on alpha. (http://cs229.stanford.edu/materials/smo.pdf)
        # [Bias Term b in SVMs Again](https://www.elen.ucl.ac.be/Proceedings/esann/esannpdf/es2004-11.pdf) gives a general representation
        self.b = (np.max(np.dot(self.phi_x, self.w)[self.y == -1]) +
                  np.min(np.dot(self.phi_x, self.w)[self.y == 1])) * -0.5

    def predict(self, X_val):
        X_val_normed = X_val / np.linalg.norm(X_val, axis=1, keepdims=True)
        return np.sign(
            np.dot(self.kernel(X_val_normed, self.x), self.alpha * self.y) +
            self.b)

    def score(self, X_val, y_val):
        prediction = self.predict(X_val)
        return np.mean(prediction == y_val)

Training Example: Jupyter notebook file

import numpy as np
from scipy.optimize import minimize
from tqdm.notebook import tqdm
import torch
from torchvision import datasets, transforms
import matplotlib.pyplot as plt
%matplotlib inline

from SVM import SVM
from Kernel_FeatureMaps import *

"0"/"9" Data Extraction from MNIST

transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5,),(0.5)),
])
trainset = datasets.MNIST('PATH_TO_STORE_TRAINSET', download=True, train=True, transform=transform)
valset = datasets.MNIST('PATH_TO_STORE_TESTSET', download=True, train=False, transform=transform)

trainloader = torch.utils.data.DataLoader(trainset)
valloader = torch.utils.data.DataLoader(valset)

ZeroNineTrain = [i for i in tqdm(trainset) if (i[1] == 0)|(i[1] == 9)]
X_train, y_train = zip(*ZeroNineTrain)
X_train = torch.stack(X_train)
y_train = torch.tensor(y_train)

ZeroNineVal = [i for i in tqdm(valset) if (i[1] == 0)|(i[1] == 9)]
X_val, y_val = zip(*ZeroNineVal)
X_val = torch.stack(X_val)
y_val = torch.tensor(y_val)

1
2
3

plt.imshow(X_train[0].squeeze(), cmap="gray_r")
plt.show()
y_train[0]

tensor(0)

torch.save(X_train, "X_train.pt")
torch.save(y_train, "y_train.pt")
torch.save(X_val, "X_val.pt")
torch.save(y_val, "y_val.pt")

Load Train & Val data and preprocess

# Load Data
X_train = torch.load("X_train.pt")
y_train = torch.load("y_train.pt")
X_val = torch.load("X_val.pt")
y_val = torch.load("y_val.pt")


# Criteria: Label0: +1, Label9: -1
y_train = (y_train==0)*2-1
y_val = (y_val==0)*2-1

1
2
3

# Formatting
X_train = X_train.squeeze().view((X_train.shape[0],-1)).numpy()
y_train = y_train.numpy()

1 2	X_val = X_val.squeeze().view((X_val.shape[0],-1)).numpy() y_val = y_val.numpy()

1
2
3

svm_linear = SVM(X_train, y_train, Kernel.Linear, FeatureMap.Linear, 64, np.inf, 1e-4)
svm_linear.fit()
svm_linear.score(X_train, y_train), svm_linear.score(X_val, y_val)

HBox(children=(FloatProgress(value=0.0, max=200000.0), HTML(value='')))


>> Optimizing, step 1. Δalpha:6.0
>> Optimized on the batch, step 4446. 5 steps Δalpha:1.7763568394002505e-15

(0.9761623989218329, 0.9773755656108597)

1
2
3

svm_SM = SVM(X_train, y_train, Kernel.Softmax, FeatureMap.Softmax_Trigonometric, 64, np.inf, 1e-4)
svm_SM.fit()
svm_SM.score(X_train, y_train), svm_SM.score(X_val, y_val)

HBox(children=(FloatProgress(value=0.0, max=200000.0), HTML(value='')))


>> Optimizing, step 1. Δalpha:15.273472888060688
>> Optimized on the batch, step 251. 5 steps Δalpha:0.0

(0.9834905660377359, 0.9803921568627451)

Result:

Train Accuracy / Validation Accuracy:

Linear SVM: (0.976, 0.977)
Softmax Kernel SVM: (0.983, 0.980)

Note that the result varies over different trials because of the randomness in batch selection and ending condition.

Problem 2: Kernel Ridge Regression [20 points]

General Ridge Regression

\[\hat{y} = X\theta\]

\[J(\theta) = \frac12(y-X\theta)^T(y-X\theta) + \frac{\lambda}2\|\theta\|^2\]

Let

\[\begin{aligned} & \nabla_{\theta}J(\theta) = 0 \\ \to \quad & \theta = (X^TX + \lambda I_d)^{-1}X^Ty \end{aligned}\]

Method 1: Approximate kernel with random features

When using feature map, substituting \(X\) with \(\phi = \phi(X)\) gives us

\[\begin{aligned} \theta &= (\phi^T\phi + \lambda I_m)^{-1}\phi^Ty \end{aligned}\]

When predicting,

\[\begin{aligned} y_{new} &= \phi(X_{new}) (\phi^T\phi + \lambda I_m)^{-1}\phi^Ty \end{aligned}\]

The time complexity of computing \(\theta\) is \(O(\max{(mn^2,m^3,m^2n)})\), considering the matrix inverse and the matrix multiplications.

Method 2: Exact kernel

If we use matrix identity trick, we can change the dimensionality of computation.

\[\begin{aligned} \theta &= \phi^T(\phi\phi^T + \lambda I_n)^{-1}y \\ &= \phi^T\big(K + \lambda I_n\big)^{-1}y \end{aligned}\]

\(K = K(X,X)\) is \(n\times n\) kernel matrix of the training data. This gives the closed form solution of the weight.

When predicting, \[\begin{aligned} y_{new} &= \phi(X_{new})\phi(X)^T\big(K + \lambda I_n\big)^{-1}y \\ &= K(X_{new}, X) \big(K + \lambda I_n\big)^{-1}y \\ &= \kappa(X_{new}) \big(K + \lambda I_n\big)^{-1}y \end{aligned}\]

The time complexity of computing \(\theta\) is \(O(\max{(n^2d, n^3, mn^2)})\). Directly computing the kernel matrix takes \(O(n^2d)\), matrix inverse \(O(n^3)\), matrix multiplication \(O(mn^2), O(mn)\)

Comparison: Time complexity of training

When \(m \gg n\), approximate kernel takes

\[O(\max{(mn^2,m^3,m^2n)}) = O(m^3)\]

exact kernel method takes

\[O(\max{(n^2d, n^3, mn^2)})= O(mn^2)\]

\[\because m^3 \gg mn^2\]

\[\therefore \text{time complexity of approximate kernel method is higher than the exact kernel method}\]

豁蒙楼

Data-Mining-HW3

HW3