Data-Mining-Lec2

(09/20)Lemmas

Euler's formula 欧拉公式

\[ e^{ix} = \cos{x} + i\sin{x} \]

Fourier transform 傅里叶变换

如何理解傅里叶变换公式? - 苗华栋的回答 - 知乎

三角函数正交性

\[ \sum_{-\pi}^{\pi}{\sin{mx}\sin{nx} dx} = \begin{cases} 0, m\neq n \\ \pi, m = n \end{cases} \]

证明: 积化和差,周期内积分。

正余弦函数组成正交基,时域 \(\to\) 频域

\[ 1, cos(x), sin(x), cos(2x), sin(2x), ... cos(nx), sin(nx) \]

公式(实值函数时)

频率表示为 \(2 \pi f\)

\[ s_{N}(x) = \frac{a_0}{2} + \sum_{n = 1}^{\infty}{[a_n\cos{(2 \pi fnx)} + b_n\sin{(2 \pi fnx)}]} \]

\[a_{n} = A_{n}\sin{(\phi_{n})}, b_{n} = A_{n}\cos{(\phi_{n})}\]

\(\frac{a_0}{2}\) 对应直流份量,任何连续周期信号都可以由一组适当的正弦曲线组合而成。(非周期信号的周期认为是无穷大)

一般地,可以用正交性求 \(a_{n}\)\(b_{n}\)

\[ a_{n} = 2f \int_{x_0}^{x_0 + \frac{1}{f}}{s(x)\cdot \cos{(2\pi fnx)}dx} \] \[ b_{n} = 2f \int_{x_0}^{x_0 + \frac{1}{f}}{s(x)\cdot \sin{(2\pi fnx)}dx} \]

Shift-invariant Kernels

Random Fourier Features(RFF)

Random Features for Large-Scale Kernel Machines

\[ k(x,y) = <\phi(x), \phi(y)> = z(x)'z(y) \]

Gaussian: \[ e^{-\frac{||\Delta||^2}{2}} \] Shift-invariant: x + z, y + z, delta remains, Radial basis function (RBF) kernel

The only thing that matters is the length of \(\Delta\)

Random Feature Map function of Gaussian: \[ \Phi(x) = \frac{1}{\sqrt{m}} (\sin(w_1^Tx),...\sin(w_m^Tx), \cos(w_1^Tx),...\cos(w_m^Tx)) \]

Bochner's Theorem

A continuous kernel \(k(x, y) = f(x − y)\) on \(\mathbb{R}^d\) is positive definite if and only if \(k(δ)\) is the Fourier transform of a non-negative measure. Here ignore the imaginery part \(\sin\).

\[ K: \mathbb{R}^d \times \mathbb{R}^d \to R \] \[ K(x,y) = f(x-y) = f(z) = \int_{\mathbb{R}^d}{p(w)cos(w^Tz)dw}=E_{w \sim \Omega}[cos(w^Tz)]\] where \(p\) is the probability distribution corresponding to \(k\)

Example, if \(K\) is Gaussian then \(\Omega = N(0, I_d)\)

Since we have the form of expectation, we run simulations or say [Monte Carlo] to estimate shift-invariant kernels.

  • choose your number of random samples
  • Sample \(w_1, ..., w_m \sim{iid} \Omega\)
  • Estimate: \(K(x,y)\) as \(\hat{K}(x,y)\)
  • \(\hat{K}(x,y)\): the mean of the sum, unbiased estimation of the original kernel.

\[ \hat{K}(x,y) = \frac{1}{m} \sum_{i=1}^{m}{cos(w_i^Tz)} = \frac{1}{m} \sum_{i=1}^m{X_i} \]

  • \(X_i\) is iid and bounded, strong concentration results.
  • \(\hat{K}(x,y) = \phi(x)^T\phi(y)\) ? How we can disentangle \(x\) and \(y\) from \(z\)? cosine identities.

\[ \hat{K}(x,y) = \frac{1}{m} \sum_{i=1}^{m}{cos(w_i^Tx - w_i^Ty)} \] \[ = \frac{1}{m} \sum_{i=1}^{m}{[cos(w_i^Tx)cos(w_i^Ty) + sin(w_i^Tx)sin(w_i^Ty)]} \]

  • Disentangled!

  • Define

\[ \phi(x) = \frac{1}{\sqrt{m}} \begin{bmatrix}\cos(w_1^Tx)\\...\\\cos(w_m^Tx)\\\sin(w_1^Tx)\\...\\\sin(w_m^Tx) \end{bmatrix} \] \[ \hat{K}(x,y) = \phi(x)^T\phi(y) \]

  • Now \(w\) can be distributions other than Gaussian: \(w_i \sim iid \Omega\)

Attention Model - Softmax kernel and random feature map

Sequence. We wanna find out similarity between embeddings(tokens) and how they attend to each other. We use kernels to calculate this attention.

Construct a huge matrix (len*len).

One most popular similarity model is softmax attention

Softmax kernel: \[ K_{SM}(x,y) = e^{x^Ty}\]

Relationship between gaussian kernel and softmax kernel? Just by expanding \[ K_{Gauss}(x,y) = K_{SM}(x,y) e^{-\frac{||x||^2}{2}} e^{-\frac{||y||^2}{2}} \]

Masked Language Modeling for Proteins via LinearlyScalable Long-Context Transformers This one covers the relationship between two kernels.

\[ SM(x,y) = e^{\frac{||x||^2}{2}}K_{Gauss}(x,y)e^{\frac{||y||^2}{2}} \]

From above, \(K(x,y) = E[\hat{K}(x,y)]\), \[ K_{Gauss}(x,y) = E[\phi(x)^T\phi(y)] \]

\[ \hat{SM}(x,y) = e^{\frac{||x||^2}{2}}\phi(x)^T\phi(y)e^{\frac{||y||^2}{2}} \] \[ = (e^{\frac{||x||^2}{2}}\phi(x))^T(e^{\frac{||y||^2}{2}}\phi(y)) \] \[ \phi_{SM}(x) = e^{\frac{||x||^2}{2}}\phi_{Gauss}(x) = e^{\frac{||x||^2}{2}} \frac{1}{\sqrt{m}} \begin{bmatrix}\cos(w_1^Tx)\\...\\\cos(w_m^Tx)\\\sin(w_1^Tx)\\...\\\sin(w_m^Tx) \end{bmatrix} \] \[ SM(x,y) = E[\hat{SM}(x,y)] = E[\phi_{SM}(x)^T\phi_{SM}(y)] \]

Orthogonal random features(ORF)

https://papers.nips.cc/paper/6246-orthogonal-random-features.pdf)

Standard setting for constructing random features via Gaussian projections

IID: \(w_1, ..., w_m \sim{iid} \Omega\). Sampling independently from canonical multivariate gaussian distribution.

Constructing orthogonal random features

Definition

ORT: for variance reduction, \(w_1, ..., w_m\) sampled in a way that,

  • marginal distributions are still the same as for IID, i.e. \(w_1 \sim{N(0,I_d)}\)

  • different samples \(w_i\) are conditioned to be exactly and definitely orthogonal, i.e. \(w_i^Tw_j = 0\) for \(i\neq j\)

They are not independent(if i.i.d, although \(E=0\), \(Var\neq 0\)). Rows have to be deterministic in the Linear Space.

How we obtain this orthogonality?

  1. Create Gaussian matrix \(G\in \mathbb{R}^{m\times d}\) with entries (a.k.a \(w_i\)) taken independently at random from scalar Gaussian distribution N(0,1).
  2. Gram-Schmidt orthogonalize \(G\). After this rows of G are exactly orthogonal and of unit L2-norm.
  3. Renormalize each row of the resulting matrix by multiplying via random variable token from \(\chi(d)\) distribution. The renormalization for each row can be same or different.
  4. Note: the procedures are feasible only when \(m \leq d\)

Gram-schmidt orthogonalization 施密特正交化

Gram–Schmidt process is a method for orthonormalizing a set of vectors in an inner product space, most commonly the Euclidean space Rn equipped with the standard inner product. The Gram–Schmidt process takes a finite, linearly independent set S = {v1, ..., vk} for k ≤ n and generates an orthogonal set S′ = {u1, ..., uk} that spans the same k-dimensional subspace of Rn as S.

Whenever move to the next row, subtract the projection from previous rows.

Complexity analysis

D: random features, d: dimensionality

ORF is as expensive as RFF.

  • ORF:
    • generating \(d\times d\) orthogonal matrix, \(O(d^3)\) time and \(O(d^2)\) space
    • computing the transformation, \(O(d^2)\) time and space

Structured ORF (SORF) ...to be continued