Data Mining Lec1

课程信息 IEORE4540

Lecturer: Krzysztof Choromanski

  • Krzysztof Choromanski works on several aspects of machine learning and robotics. His current research interests include reinforcement learning and randomized methods such as nonlinear embeddings based on structured random feature maps and quasi-Monte-Carlo methods. He was also working on online nonparametric clustering for massive high-dimensional streams. Krzysztof is an author of several nonlinear embedding mechanisms based on structured matrices that can be used to speed up: neural network computations, kernel methods applying random feature maps, convex optimization solvers, quantization and soft clustering methods as well as several LSH-based algorithms. With his background in structural graph theory, he is also interested in applying graph theory and other combinatorial methods in machine learning.

Contact: KMC2178@columbia.edu|CHOROMANSKI1@gmail.com|KCHORO@google.com

提要

第一节课主要涉及几类核方法(Kernel),random feature map,和MSE

笔记正文

补充: 核函数 Kernels 与 SVM

cs229 Lecture 12.4 — Support Vector Machines | (Kernels-I) — [ Machine Learning | Andrew Ng]

Kernels

Kernel is aka similarity function. We set up landmarks: \(l_1, l_2, l_3\) and we wanna know the similarity between our input \(x\) and these landmarks.

Gaussian kernel \(K_{Gauss}\) is

\[f_i = similarity(x, l^{(i)}) = exp(-\frac{||x-l^{(i)}||^2}{2\sigma^2})\]

\(\sigma\) is one hyperparameter which decides the shape of "contour graph" -- aka scaling the similarity

Then we are gonna use similarities \(f_i\) to classify \(y\).

Predict \(y=1\) if

\[\theta_0 + \theta_1f_1 + \theta_2f_2 + \theta_3f_3 \geq 0\]

Here, \(\theta\) s are weights. If we wanna know whether \(x\) is close to landmarks \(l_1\) and \(l_2\), we could set \(\theta_1 = 1, \theta_2 = 1, \theta_3 = 0\).

SVM with Kernels

Given \((x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), ..., (x^{(m)}, y^{(m)})\)

we choose landmarks \(l^{(i)} = x^{(i)}\)

\[f_1 = similarity(x, l^{(1)})\] \[f_2 = similarity(x, l^{(2)})\] \[......\] \[f_m = similarity(x, l^{(m)})\]

Define \(f\)

\[f = \begin{bmatrix}f_0\\f_1\\f_2\\...\\f_m\end{bmatrix}\] where \(f_0 = 1\)

for training example \[f^{(i)} = \begin{bmatrix}f_0^{(i)}\\f_1^{(i)}\\f_2^{(i)}\\...\\f_m^{(i)}\end{bmatrix}\]

\(f \in \large{R}^{m+1}\), predict \(y=1\) if \(\theta^{T}f \geq 0\)

\(\theta \in \large{R}^{m+1}\)

When training, the objective function changes w.r.t \(f\)

\[\min_{\theta} C\sum_{i=1}^{m}{y^{(i)}cost_1{(\theta^Tf^{(i)})} + (1-y^{(i)})cost_0{(\theta^Tf^{(i)})}} + \frac{1}{2} \sum_{j=1}^{m}{\theta_j^2}\]

Wiki: Kernel Method

使用内核函数,在高维隐式特征空间中操作,而无需计算该空间中数据的坐标,而是在特征空间中计算内积。

基于实例: 内核方法基于实例学习,并非训练一些固定数量的参数, 而是记住 \(i\)-th training example \((\mathbf {x}_{i},\mathbf {y}_{i})\) 并学习到对应的权重 \(w_{i}\).

\[\therefore w:\mathbf R^m, x:\mathbf R^m\]

总结一下遇到的各种核函数

PCA在原始空间中的数学模型 \[XX^Tw_i=\lambda_iw_i\] 在高维空间中 \[\Phi(X)\Phi(X)^Tw_i^{\Phi} = \lambda_iw_i^{\Phi}\] 基向量 \(w_i^{\Phi}\) 用训练样本线性表示(上文提到 \(w\)\(x\) 关系,都在 \(N\times d\) 维空间) \[w_i^{\Phi} = \sum_{k=1}^{N}{\alpha_k\Phi(x_k)} = \Phi(X)\alpha_i \] 代入 \[\Phi(X)\Phi(X)^T\Phi(X)\alpha_i = \lambda_i\Phi(X)\alpha_i\] \[\Phi(X)^T\Phi(X)\Phi(X)^T\Phi(X)\alpha_i = \lambda_i\Phi(X)^T\Phi(X)\alpha_i\] 某种核函数满足 \[ k(x_i, x_j) = \Phi(x_i)^T\Phi(x_j) \] 矩阵化表示 \[ K(X,Y) = \Phi(X)^T\Phi(X) \]\[ K^2\alpha_i = \lambda_iK\alpha_i \] 可以看见高维映射被抹掉了。

那么核函数具有什么样的性质?

Mercer's theorem

Mercer's theorem: 任何半正定的函数都可以作为核函数。(充分不必要)

\[X = (x_1,x_2,...x_n)\]

我们定义矩阵的元素 \[a_{ij} = f(x_i,x_j)\]

如果这个 \(n*n\) 的矩阵是半正定的,那么 \(f(x_i,x_j)\) 就称为半正定的函数。

常见核函数

Linear Kernel

\[ k(x,y) = x^Ty \]

Gaussian Kernel

\[ k_{Gauss}(x,y) = \exp(-\frac{||x-y||^2}{2\sigma^2}) \]

Exponential Kernel

将高斯核的 L2 距离调整为 L1 距离

\[ k(x,y) = \exp(-\frac{||x-y||}{2\sigma^2}) \]

Laplacian Kernel

等价于指数核

\[ k(x,y) = \exp(-\frac{||x-y||}{\sigma}) \]

Softmax Kernel

\[ K_{SM}(x,y) = e^{x^Ty} \] \[ = e^{\frac{||x||^2}{2}} e^{\frac{||y||^2}{2}} K_{Gauss}(x,y) \]

Angular Kernel

\[ K_{Ang}(x,y) = 1 - \frac{2\theta_{x,y}}{\pi} \]

feature map \(\Phi\) 的结构

\[ \Phi(x) = \frac{1}{\sqrt{m}} \begin{bmatrix}f_1(w_1^Tx),...,f_1(w_m^Tx)\\...\\f_l(w_1^Tx),...,f_l(w_m^Tx)\end{bmatrix}^T \]