Did not consider regularization, a.k.a. set \(C=\infty\), because this problem (binary classification between 0/9) should be separable, and the representation of \(b\) becomes nasty with regularization.
Used the SMO (sequential minimal optimization) algorithm. During optimization, within each iteration, randomly select \(\alpha_1, \alpha_2\), optimize the QP w.r.t. \(\alpha_2\) and update \(\alpha_1\) accordingly. Added constraint \(\alpha_2 \geq 0\) to the \(\alpha_2\) optimizer. This does not constrain \(\alpha_1\geq 0\) directly. However, with the randomization within each iteration, \(\alpha_i \geq 0\) is satisfied when the whole optimzation over \(\alpha\) finally converges.
Provide 2 options: Linear Kernel (baseline SVM) or Softmax kernel
\[K_{SM}(x, y) = \exp{x^\top y}\]
To avoid explosion on scale, normalized the input \(x\).
Included the trigonometric feature map \(\phi(x)\) of the softmax kernel for calculating \(b\), (not for \(w\) because when making prediction, we use kernel function instead of \(w^\top \phi\)).
Use exact kernel instead of approximating kernel with random feature map, because softmax's dimensionality is infinite. Directly compute the exponential in numpy is more efficient.
The prediction comes like this (vectorized version):
\[y_{new} = K(X_{new}, X)\cdot (\alpha * y) + b\]
Note that b is broadcasted to \(n'\) new data points. \(K(X_{new}, X)\) is \(n' \times n\) kernel matrix. \(\alpha*y\) means the elementwise product of two vectors.
SVM is expensive when \(n\) is large. Here in practice, we trained on a small batch (default=64). The randomness here influences the performance.
Have a few trial runs to get the model with best prediction on the training data. Then it should give good prediction on the validation data. You might also need to tune the hyperparameters a little bit, like batch_size and tol.
# Softmax trignometric feature map Φ(x) defbaseline_SM(x, G): """Calculate the result of softmax trigonometrix random feature mapping Parameters ---------- x: array, dimension = n*d Input to the baseline mapping Required to be of norm 1 for i=1,...n G: matrix, dimension = m*d The matrix in the baseline random feature mapping """ m = G.shape[0] left = np.cos(np.dot(x, G.T).astype(np.float32)) right = np.sin(np.dot(x, G.T).astype(np.float32)) return np.exp(0.5) * ((1 / m)**0.5) * np.hstack([left, right])
classFeatureMap(object): """Feature mapping Parameters --------- x: array of (n, d) Return --------- Φ(x): array of (n, m), where m is usually a higher dimensionality. Here we set m = 2*n """ @staticmethod defLinear(x): return x
@staticmethod defSoftmax_Trigonometric(x): n, d = x.shape # Increase dimensionality to 2x G = generateGMatrix(2 * d, d) phi_x = baseline_SM(x, G) return phi_x
import numpy as np from scipy.optimize import minimize from tqdm.notebook import tqdm from Kernel_FeatureMaps import *;
classSVM(object): """ ## Author: Hua Yao (hy2632@columbia.edu) ## Description: SVM binary classifier, optimizing with the dual lagrangian program, trained on a sample batch. Uses the SMO algorithm. Normalizes input X to adapt to kernelized version. Regularization parameter C set to be `np.inf` for easy closed form solution of b. ## Reference: [CS229 - Kernel Methods and SVM](http://cs229.stanford.edu/notes2020fall/notes2020fall/cs229-notes3.pdf) --- Parameters: --------- X: (N, d) Y: (N,) kernel: kernel used, linear/softmax featuremap: feature mapping corresponding to the kernel used batch_size: int, also denoted as n C: l1 regularization term for soft margin. alpha_i in [0, C]. Set as np.inf (no regularization), because the form of b is nasty under regularization. tol = 1e-6: tolerance, deciding when to end training Intermediate parameters: --------- x: (n, d), random batch of X y: (n) phi_x: (n, m), feature map(s) of x M: (n, n), M[i,j] = y_iy_j * K(x_i,x_j), hadamard product of y^Ty and K Learned parameters: --------- alpha: (n,) w: (d,) b: int """ def__init__(self, X, Y, kernel=Kernel.Linear, featuremap=FeatureMap.Linear, batch_size=64, C=np.inf, tol=1e-6): # C set as np.inf here -- no regularization
# X: (N,d), Y: (N,), x: (n,d), y:(n,) # Fixed values self.N, self.d = X.shape # Normalize data self.X = X self.X = self.X / np.linalg.norm(self.X, axis=1, keepdims=True) self.Y = Y self.kernel = kernel self.featuremap = featuremap self.n = batch_size self.C = C self.tol = tol
# Check convergence if (i % 5 == 1): if np.sum(np.abs(self.alpha - prev_val)) < self.tol: print( f">> Optimized on the batch, step {i}. 5 steps Δalpha:{np.sum(np.abs(self.alpha - prev_val))}" ) return else: if (i % 5000 == 1): print( f">> Optimizing, step {i}. Δalpha:{np.sum(np.abs(self.alpha - prev_val))}" ) prev_val = self.alpha.copy()
# Retrieve w and b self.w = np.dot(self.alpha * self.y, self.phi_x) # The form of b changes when there exists Regularization C. So we simply cancel C here. # Look at P25 of CS229 - Note 3. # Form of b under regularization depends on alpha. (http://cs229.stanford.edu/materials/smo.pdf) # [Bias Term b in SVMs Again](https://www.elen.ucl.ac.be/Proceedings/esann/esannpdf/es2004-11.pdf) gives a general representation self.b = (np.max(np.dot(self.phi_x, self.w)[self.y == -1]) + np.min(np.dot(self.phi_x, self.w)[self.y == 1])) * -0.5
import numpy as np from scipy.optimize import minimize from tqdm.notebook import tqdm import torch from torchvision import datasets, transforms import matplotlib.pyplot as plt %matplotlib inline
from SVM import SVM from Kernel_FeatureMaps import *
The time complexity of computing \(\theta\) is \(O(\max{(n^2d, n^3, mn^2)})\). Directly computing the kernel matrix takes \(O(n^2d)\), matrix inverse \(O(n^3)\), matrix multiplication \(O(mn^2), O(mn)\)
Comparison: Time complexity of training
When \(m \gg n\), approximate kernel takes
\[O(\max{(mn^2,m^3,m^2n)}) = O(m^3)\]
exact kernel method takes
\[O(\max{(n^2d, n^3, mn^2)})= O(mn^2)\]
\[\because m^3 \gg mn^2\]
\[\therefore \text{time complexity of approximate kernel method is higher than the exact kernel method}\]