(09/20)Lemmas
Euler's formula 欧拉公式
\[ e^{ix} = \cos{x} + i\sin{x} \]
Fourier transform 傅里叶变换
三角函数正交性
\[ \sum_{-\pi}^{\pi}{\sin{mx}\sin{nx} dx} = \begin{cases} 0, m\neq n \\ \pi, m = n \end{cases} \]
证明: 积化和差,周期内积分。
正余弦函数组成正交基,时域 \(\to\) 频域
\[ 1, cos(x), sin(x), cos(2x), sin(2x), ... cos(nx), sin(nx) \]
公式(实值函数时)
频率表示为 \(2 \pi f\)
\[ s_{N}(x) = \frac{a_0}{2} + \sum_{n = 1}^{\infty}{[a_n\cos{(2 \pi fnx)} + b_n\sin{(2 \pi fnx)}]} \]
\[a_{n} = A_{n}\sin{(\phi_{n})}, b_{n} = A_{n}\cos{(\phi_{n})}\]
\(\frac{a_0}{2}\) 对应直流份量,任何连续周期信号都可以由一组适当的正弦曲线组合而成。(非周期信号的周期认为是无穷大)
一般地,可以用正交性求 \(a_{n}\) 和 \(b_{n}\)。
\[ a_{n} = 2f \int_{x_0}^{x_0 + \frac{1}{f}}{s(x)\cdot \cos{(2\pi fnx)}dx} \] \[ b_{n} = 2f \int_{x_0}^{x_0 + \frac{1}{f}}{s(x)\cdot \sin{(2\pi fnx)}dx} \]
Shift-invariant Kernels
Random Fourier Features(RFF)
Random Features for Large-Scale Kernel Machines
\[ k(x,y) = <\phi(x), \phi(y)> = z(x)'z(y) \]
Gaussian: \[ e^{-\frac{||\Delta||^2}{2}} \] Shift-invariant: x + z, y + z, delta remains, Radial basis function (RBF) kernel
The only thing that matters is the length of \(\Delta\)
Random Feature Map function of Gaussian: \[ \Phi(x) = \frac{1}{\sqrt{m}} (\sin(w_1^Tx),...\sin(w_m^Tx), \cos(w_1^Tx),...\cos(w_m^Tx)) \]
Bochner's Theorem
A continuous kernel \(k(x, y) = f(x − y)\) on \(\mathbb{R}^d\) is positive definite if and only if \(k(δ)\) is the Fourier transform of a non-negative measure. Here ignore the imaginery part \(\sin\).
\[ K: \mathbb{R}^d \times \mathbb{R}^d \to R \] \[ K(x,y) = f(x-y) = f(z) = \int_{\mathbb{R}^d}{p(w)cos(w^Tz)dw}=E_{w \sim \Omega}[cos(w^Tz)]\] where \(p\) is the probability distribution corresponding to \(k\)
Example, if \(K\) is Gaussian then \(\Omega = N(0, I_d)\)
Since we have the form of expectation, we run simulations or say [Monte Carlo] to estimate shift-invariant kernels.
- choose your number of random samples
- Sample \(w_1, ..., w_m \sim{iid} \Omega\)
- Estimate: \(K(x,y)\) as \(\hat{K}(x,y)\)
- \(\hat{K}(x,y)\): the mean of the sum, unbiased estimation of the original kernel.
\[ \hat{K}(x,y) = \frac{1}{m} \sum_{i=1}^{m}{cos(w_i^Tz)} = \frac{1}{m} \sum_{i=1}^m{X_i} \]
- \(X_i\) is iid and bounded, strong concentration results.
- \(\hat{K}(x,y) = \phi(x)^T\phi(y)\) ? How we can disentangle \(x\) and \(y\) from \(z\)? cosine identities.
\[ \hat{K}(x,y) = \frac{1}{m} \sum_{i=1}^{m}{cos(w_i^Tx - w_i^Ty)} \] \[ = \frac{1}{m} \sum_{i=1}^{m}{[cos(w_i^Tx)cos(w_i^Ty) + sin(w_i^Tx)sin(w_i^Ty)]} \]
Disentangled!
Define
\[ \phi(x) = \frac{1}{\sqrt{m}} \begin{bmatrix}\cos(w_1^Tx)\\...\\\cos(w_m^Tx)\\\sin(w_1^Tx)\\...\\\sin(w_m^Tx) \end{bmatrix} \] \[ \hat{K}(x,y) = \phi(x)^T\phi(y) \]
- Now \(w\) can be distributions other than Gaussian: \(w_i \sim iid \Omega\)
Attention Model - Softmax kernel and random feature map
Sequence. We wanna find out similarity between embeddings(tokens) and how they attend to each other. We use kernels to calculate this attention.
Construct a huge matrix (len*len).
One most popular similarity model is softmax attention
Softmax kernel: \[ K_{SM}(x,y) = e^{x^Ty}\]
Relationship between gaussian kernel and softmax kernel? Just by expanding \[ K_{Gauss}(x,y) = K_{SM}(x,y) e^{-\frac{||x||^2}{2}} e^{-\frac{||y||^2}{2}} \]
Masked Language Modeling for Proteins via LinearlyScalable Long-Context Transformers This one covers the relationship between two kernels.
\[ SM(x,y) = e^{\frac{||x||^2}{2}}K_{Gauss}(x,y)e^{\frac{||y||^2}{2}} \]
From above, \(K(x,y) = E[\hat{K}(x,y)]\), \[ K_{Gauss}(x,y) = E[\phi(x)^T\phi(y)] \]
\[ \hat{SM}(x,y) = e^{\frac{||x||^2}{2}}\phi(x)^T\phi(y)e^{\frac{||y||^2}{2}} \] \[ = (e^{\frac{||x||^2}{2}}\phi(x))^T(e^{\frac{||y||^2}{2}}\phi(y)) \] \[ \phi_{SM}(x) = e^{\frac{||x||^2}{2}}\phi_{Gauss}(x) = e^{\frac{||x||^2}{2}} \frac{1}{\sqrt{m}} \begin{bmatrix}\cos(w_1^Tx)\\...\\\cos(w_m^Tx)\\\sin(w_1^Tx)\\...\\\sin(w_m^Tx) \end{bmatrix} \] \[ SM(x,y) = E[\hat{SM}(x,y)] = E[\phi_{SM}(x)^T\phi_{SM}(y)] \]
Orthogonal random features(ORF)
https://papers.nips.cc/paper/6246-orthogonal-random-features.pdf)
Standard setting for constructing random features via Gaussian projections
IID: \(w_1, ..., w_m \sim{iid} \Omega\). Sampling independently from canonical multivariate gaussian distribution.
Constructing orthogonal random features
Definition
ORT: for variance reduction, \(w_1, ..., w_m\) sampled in a way that,
marginal distributions are still the same as for IID, i.e. \(w_1 \sim{N(0,I_d)}\)
different samples \(w_i\) are conditioned to be exactly and definitely orthogonal, i.e. \(w_i^Tw_j = 0\) for \(i\neq j\)
They are not independent(if i.i.d, although \(E=0\), \(Var\neq 0\)). Rows have to be deterministic in the Linear Space.
How we obtain this orthogonality?
- Create Gaussian matrix \(G\in \mathbb{R}^{m\times d}\) with entries (a.k.a \(w_i\)) taken independently at random from scalar Gaussian distribution N(0,1).
- Gram-Schmidt orthogonalize \(G\). After this rows of G are exactly orthogonal and of unit L2-norm.
- Renormalize each row of the resulting matrix by multiplying via random variable token from \(\chi(d)\) distribution. The renormalization for each row can be same or different.
- Note: the procedures are feasible only when \(m \leq d\)
Gram-schmidt orthogonalization 施密特正交化
Gram–Schmidt process is a method for orthonormalizing a set of vectors in an inner product space, most commonly the Euclidean space Rn equipped with the standard inner product. The Gram–Schmidt process takes a finite, linearly independent set S = {v1, ..., vk} for k ≤ n and generates an orthogonal set S′ = {u1, ..., uk} that spans the same k-dimensional subspace of Rn as S.
Whenever move to the next row, subtract the projection from previous rows.
Complexity analysis
D: random features, d: dimensionality
ORF is as expensive as RFF.
- ORF:
- generating \(d\times d\) orthogonal matrix, \(O(d^3)\) time and \(O(d^2)\) space
- computing the transformation, \(O(d^2)\) time and space