Data Mining Midterm

Hua Yao(UNI: hy2632)

Problem 1: Anisotropic Gaussian Kernels (50 points)

Given: \[K(x,y) = (2\pi)^{-\frac{d}{2}}(\det(\Sigma))^{-\frac{1}{2}}\exp(-\frac{1}{2}(x-y)^{\top}\Sigma^{-1}(x-y)), \] \(\Sigma \in \mathbb{R}^{d\times d}\) is positive definite symmetric.

Show that K does not need to be an RBF kernel (10 points)

Radius basis function kernel(RBF) is defined as \[K(x,y) = \exp\big( -\frac{||x-y||^2}{2\sigma^2} \big)\] While \[K(x,y) = \big[(2\pi)^{-\frac{d}{2}}(\det(\Sigma))^{-\frac{1}{2}}\big] \cdot \big[\exp(-\frac{1}{2}(x-y)^{\top}\Sigma^{-1}(x-y))\big]\]

Therefore, K is not necessarily an RBF for 2 reasons:

The constant coefficient \((2\pi)^{-\frac{d}{2}}(\det(\Sigma))^{-\frac{1}{2}}\) is not necessarily \(1\)
\(\exp(-\frac{1}{2}(x-y)^{\top}\Sigma^{-1}(x-y))\) does not necessarily have the form of \(\exp\big( -\frac{||x-y||^2_2}{2\sigma^2}\big)\)

Give necessary and sufficient conditions for K to be an RBF function (10 points)

Condition 1: \((2\pi)^{-\frac{d}{2}}(\det(\Sigma))^{-\frac{1}{2}} = 1\)

\[\det(\Sigma) = (\frac{1}{2\pi})^d\]

Also, since \(\Sigma\) is positive definite symmetric, all of its eigenvalues are positive, \[\therefore \det(\Sigma^{-1}) = (2\pi)^d \]

Condition 2: \(\exp(-\frac{1}{2}(x-y)^{\top}\Sigma^{-1}(x-y)) = \exp\big( -\frac{||x-y||^2}{2\sigma^2}\big)\)

\[(x-y)^{\top}\Sigma^{-1}(x-y) = \frac{||x-y||^2}{\sigma^2}\]

Let \(z = x-y \in \mathbb{R}^{d}\),

\[z^{\top}\Sigma^{-1}z = \frac{||z||^2}{\sigma^2} = \frac{z^{\top}z}{\sigma^2} = z^{\top}\frac{I}{\sigma^2}z\]

\[z^{\top}(\Sigma^{-1}-\frac{I}{\sigma^2})z = \mathbf{0}\]

\(\because \Sigma^{-1}, \frac{I}{\sigma^2}\) are both real symmetric

\[\therefore \Sigma^{-1} - \frac{I}{\sigma^2} = 0\]

\[\therefore \Sigma^{-1} = \frac{I}{\sigma^2}, \Sigma = \sigma^2 I\]

Combine results from the previous 2 parts

From Part 2, \[\det({\Sigma}) = \prod_{i=1}^d{\lambda_i} = (\sigma^2)^d\]

From Part 1, \[\det(\Sigma) = (\frac{1}{2\pi})^d\]

\[\therefore \sigma^2 = \frac{1}{2\pi}\]

Conclusion

To sum up, the necessary and sufficient conditions for \(K\) to be an RBF function is \[\Sigma = \frac{1}{2\pi}I_d\]

\[\Sigma^{-1} = 2\pi I_d\]

Show that K is PSD function and provide random feature map \(\phi:\mathbb{R}_{d}→\mathbb{R}_{m}\) for K (25 points)

Random Feature Map for Multivariate Gaussian

\[K(x,y) = (2\pi)^{-\frac{d}{2}}(\det(\Sigma))^{-\frac{1}{2}}\exp(-\frac{1}{2}(x-y)^{\top}\Sigma^{-1}(x-y)) \]

\[X = \{x_1, ...x_N\} \in \mathbb{R}^{d\times N}\]

For 1-d gaussian distribution, the pdf is

\[f(x) = \frac{1}{\sqrt{2 π}σ}e^{\frac{-(x - μ)^2}{2 σ^2}}\]

And we have trignometric random feature maps because

\[E_w[cos(\frac{w^T(x-y))}{\sigma}] = e^{\frac{-(x - y)^2}{2 σ^2}}\]

Similarly, for multivariate Gaussian distribution,

\[E_w[cos(w^T\Sigma^{-\frac12}(x-y))] = \exp(-\frac{1}{2}(x-y)^{\top}\Sigma^{-1}(x-y))\]

Here we propose the random feature map \[ \phi(x) = \frac{\sqrt{(2\pi)^{-\frac{d}{2}}(\det(\Sigma))^{-\frac{1}{2}}}}{\sqrt{m}} \begin{bmatrix} \cos(w_1^T\Sigma^{-\frac12}x)\\...\\ \cos(w_m^T\Sigma^{-\frac12}x)\\ \sin(w_1^T\Sigma^{-\frac12}x)\\...\\ \sin(w_m^T\Sigma^{-\frac12}x) \end{bmatrix} \]

The \(\Sigma^{-\frac12}\) is derived from eigenvalue diagonalization: * Since \(\Sigma\) is postive definite, the diagonalized \(\Lambda\) have positive eigenvalues on its diagonal \[\Sigma = U\Lambda U^T = (U\Lambda^{\frac12})(U\Lambda^{\frac12})^T = AA^T, A = U\Lambda^{\frac12}\] * Then for \(\Sigma^{-1}\), \[\Sigma^{-1} = (AA^T)^{-1}=(A^{-1})^TA^{-1}\] Here \(A^{-1}\) is the \(\Sigma^{-\frac12}\).

Proof of PSD

From above we know the random feature map for this kernel. \[K(x,y) = (2\pi)^{-\frac{d}{2}}(\det(\Sigma))^{-\frac{1}{2}}\exp(-\frac{1}{2}(x-y)^{\top}\Sigma^{-1}(x-y)) \]

\[X = \{x_1, ..., x_n\} \subseteq \mathbb{R}^d \]

\[\kappa(X)_{ij} =K(x_i,x_j)= E[\phi(x_i)\phi(x_j)^T]\]

\[\begin{aligned} v\kappa(X)v^T &= \sum_{ij}E[\phi(x_i)\phi(x_j)^T]v_iv_j\\ &= \sum_{ij}E[\phi(x_i)\phi(x_j)^Tv_iv_j]\\ &= E[\sum_{ij}\phi(x_i)\phi(x_j)^Tv_iv_j]\\ \end{aligned}\]

\[\Phi(X) \in \mathbb{R}^{N\times m} = \begin{bmatrix} \phi(x_1)\\ ...\\ \phi(x_N)\\ \end{bmatrix} \]

\[\Phi(X)\Phi(X)^T \in \mathbb{R}^{N\times N}, \Phi(X)\Phi(X)^T_{ij} = \phi(x_i)\phi(x_j)^T\]

\[\begin{aligned} \therefore v\kappa(X)v^T &= E[v\Phi(X)\Phi(X)^Tv^T]\\ &= E[v\Phi(X)(v\Phi(X))^T] \geq 0 \end{aligned}\]

What is time complexity of computing \(\phi\) (5 points)?

From above, \[ \phi(x) = \frac{\sqrt{(2\pi)^{-\frac{d}{2}}(\det(\Sigma))^{-\frac{1}{2}}}}{\sqrt{m}} \begin{bmatrix} \cos(w_1^T\Sigma^{-\frac12}x)\\...\\ \cos(w_m^T\Sigma^{-\frac12}x)\\ \sin(w_1^T\Sigma^{-\frac12}x)\\...\\ \sin(w_m^T\Sigma^{-\frac12}x) \end{bmatrix} \]

Firstly we diagonalize \(\Sigma\) to get \(\Sigma^{-\frac12}\) : \(\Sigma = U\Lambda U^T = (U\Lambda^{\frac12})(U\Lambda^{\frac12})^T = AA^T\). This decomposition takes \(O(d^3)\).
Then we compute each entry of \(\phi(x)\): \(\cos(w_i^T\Sigma^{-\frac12}x)\), this takes \(O(d^2)\). So for all entries it takes \(O(md^2)\)
The overall time complexity is the larger of \(O(d^3)\) and \(O(md^2)\). Usually \(m>d\) so the final result is \(O(md^2)\).

Problem 2: Composite Attention (50 points)

Algorithm 1. Online Linear-Time Attention

For example, we only attend to the closest 1 token, a.k.a only compute \(A_{i,i\pm 1}\); Or attend to a few tokens, but learn what to attend to.

Algorithm 2. Efficient Dense Attention

Construct random feature map for this composite kernel

Noticed that this kernel is the product of softmax and angular kernel.

\[\begin{aligned} A_{ij} &= K_{SM}(d^{-\frac{1}{4}}q_i, d^{-\frac{1}{4}}k_j)K_{Ang}(q_i, k_j)\\ &=E[\phi_{SM}(\bar{q_i})\phi_{SM}(\bar{k_j})^T] E[\phi_{Ang}(q_i)\phi_{Ang}(k_j)^T]\\ &\approx E[\phi_{SM}(\bar{q_i})\phi_{SM}(\bar{k_j})^T * \phi_{Ang}(q_i)\phi_{Ang}(k_j)^T] \end{aligned}\]

Here we denote \(d^{-\frac{1}{4}}x\) as \(\bar{x}\).

Use positive random features for \(\phi_{SM}\): \[\phi_{SM} = \frac{1}{\sqrt m}e^{-\frac{||x||^2}{2}}[e^{w_1^Tx},...,e^{w_m^Tx}]\]

\[\phi_{Ang} = \frac{1}{\sqrt m}[sgn(w_1^Tx),...,sgn(w_m^Tx)] \]

\[\begin{aligned} E_w[\phi_{SM}(\bar{x})\phi_{SM}(\bar{y})^T*\phi_{Ang}(x)\phi_{Ang}(y)^T] &= E_w[\frac{1}{m^2}e^{-\frac{||\bar{x}||^2}{2}}e^{-\frac{||\bar{y}||^2}{2}}\sum_{i=1}^{m}{e^{w_i^T\bar{x}}e^{w_i^T\bar{y}}} \sum_{i=1}^{m}{sgn(w_i^Tx)sgn(w_i^Ty)}]\\ &= E_w[\frac{1}{m^2}e^{-\frac{||\bar{x}||^2}{2}}e^{-\frac{||\bar{y}||^2}{2}}\sum_{i=1}^{m}{e^{w_i^T\bar{x}}e^{w_i^T\bar{y}}}\cdot m \cdot E_w[sgn(w^Tx)sgn(w^Ty)]]\\ & \scriptstyle{\text{Use one point estimation}}\\ &\approx E_w[\frac{1}{m}e^{-\frac{||\bar{x}||^2}{2}}e^{-\frac{||\bar{y}||^2}{2}}\sum_{i=1}^{m}{e^{w_i^T\bar{x}}e^{w_i^T\bar{y}}}{sgn(w_{m+1}^Tx)sgn(w_{m+1}^Ty)}] \end{aligned}\]

The new proposed random feature here is \[\phi_{comp}(x) = \frac{1}{\sqrt{m}}e^{-\frac{||\bar{x}||^2}{2}}\begin{bmatrix} e^{w_1^T\bar{x}}sgn(w_{m+1}^Tx)\\ ...\\ e^{w_m^T\bar{x}}sgn(w_{m+1}^Tx) \end{bmatrix}, \bar{x} = d^{-\frac{1}{4}}x \]

Each time we generate an extra \(w_{m+1}\in \mathbb{R}^{d}\) to compute the \(sgn\) term and apply to all entries in the feature map.

Apply the Dense Attention mechanism

Since we already found a proper random feature map for this composite kernel, we can conduct "efficient dense attention" algorithm.

\[\begin{aligned} A_{ij} &= K_{SM}(d^{-\frac{1}{4}}q_i, d^{-\frac{1}{4}}k_j)K_{Angular}(q_i, k_j)\\ &= K_{comp}(q_i, kj)\\ &\approx E[\phi_{comp}(q_i)^T\phi_{comp}(k_j)] \end{aligned}\]

Then in matrix form, \[A\approx E[Q'(K')^T]\]

\[Q'=[\phi_{comp}(q_1)..., \phi_{comp}(q_l)]^T\]

\[K'=[\phi_{comp}(k_1)..., \phi_{comp}(k_l)]^T\]

\[AV \approx E[Q'(K')^T]V = E[Q'(K')^TV]\approx Q'(K')^TV = Q'((K')^TV)\]

\[Q',K'\in \mathbb{R}^{L\times m}, V\in \mathbb{R}^{L\times d}\]

We take: \(O(mLd)\) time to get \((K')^TV\in \mathbb{R}^{m\times d}\) and \(O(mLd)\) time to get \(Q'(K')^TV\).

Problem 3: MLPs for attention (30 points)

\[E[Q'(K')^T] = A = \exp(\frac{QK^T}{\sqrt{d}})\]

\[Q\in \mathbb{R}^{L\times d}, Q'\in \mathbb{R}^{L\times m}\]

\[A_{ij} = \exp(\frac{q_ik_j^T}{\sqrt{d}})\]

\[\begin{aligned} \therefore E[q_i'k_j'^T] & =\exp(\frac{q_ik_j^T}{\sqrt{d}})\\ & =K_{SM}(d^{-\frac{1}{4}}q_i, d^{-\frac{1}{4}}k_j)\\ & =E[\phi(d^{-\frac{1}{4}}q_i)\phi(d^{-\frac{1}{4}}k_j)^T] \end{aligned}\]

One valid pair of \(Q'\) and \(K'\) satisfy:

\[q_i' = \phi(d^{-\frac{1}{4}}q_i)\]

\[k_j' = \phi(d^{-\frac{1}{4}}k_j)\]

\[q_i', k_j' \in \mathbb{R}^{m}\]

Here we use positive random features \(\phi_m^+:\mathbb{R}^d \to \mathbb{R}^m\) to guarantee that the dimensionality of output is \(m\).

\[\phi_m^{+}(x) = e^{-\frac{||x||^2}{2}}\frac{1}{\sqrt{m}}\begin{bmatrix}\exp(w_1^Tx)\\...\\ \exp(w_m^Tx) \end{bmatrix}\]

\[w_1,...w_m \sim N(0,I_d)\]

\[q_i' = \phi_m^{+}(d^{-\frac{1}{4}}q_i)\]

\[k_j' = \phi_m^{+}(d^{-\frac{1}{4}}k_j)\]

Map all queries \(q_1, ... q_L\) and keys \(k_1, ..., k_L\) to generate rows of \(Q'\) and \(K'\).

Problem 4: Gaussian kernel via random feature maps (40 points)

Take hyperparameter \(\sigma^2=1\),

\[K_{Gauss}(x,y) = \exp(-\frac{||x-y||^2}{2})\]

Trigonometric feature map

\[K_{Gauss}(x,y) = E[\phi(x)\phi(y)^T]\]

\[\phi(x)\stackrel{def}{=}\frac{1}{\sqrt{m}} \begin{bmatrix}\cos(w_1^Tx)\\...\\ \cos(w_m^Tx)\\ \sin(w_1^Tx)\\...\\ \sin(w_m^Tx) \end{bmatrix}\]

\[w\stackrel{dist}{=} N(0,I_d)\]

This feature map leads to unbiased estimation.

\[\begin{aligned} \phi(x)\phi(y)^T &= \frac{1}{m}\sum_{i=1}^{m}{\cos(w_i^Tx)\cos(w_i^Ty)+\sin(w_i^Tx)\sin(w_i^Ty)}\\ &= \frac{1}{m}\sum_{i=1}^{m}{\cos(w_i^T(x-y))} \end{aligned} \]

\[\begin{aligned} E[\phi(x)\phi(y)^T] &= E[\frac{1}{m}\sum_{i=1}^{m}{\cos(w_i^T(x-y))}]\\ &=E_w[\cos(w^T(x-y))]\\ &= \Re \int{e^{w^T(x-y)i}dP(w)}\\ &=\exp(-\frac{||x-y||^2}{2}) \end{aligned} \]

Positive feature map

For softmax kernel, the postive feature map is \[\phi_m'^{+}(x) = e^{-\frac{||x||^2}{2}}\frac{1}{\sqrt{m}}\begin{bmatrix}\exp(w_1^Tx)\\...\\ \exp(w_m^Tx) \end{bmatrix}\]

We know that

\[\begin{aligned} K_{Gauss}(x,y) &= \exp(-\frac{||x-y||^2}{2}) \\ &= e^{-\frac{||x||^2}{2}}e^{-\frac{||y||^2}{2}}e^{x^Ty}\\ &= e^{-\frac{||x||^2}{2}}e^{-\frac{||y||^2}{2}}K_{SM}(x,y)\\ \end{aligned} \]

\[\phi_{Gauss}(x) = e^{-\frac{||x||^2}{2}}\phi_{SM}(x)\]

Then for Gaussian kernel \[\phi_m^{+}(x) = e^{-||x||^2}\frac{1}{\sqrt{m}}\begin{bmatrix}\exp(w_1^Tx)\\...\\ \exp(w_m^Tx) \end{bmatrix}\]

Now prove that positive random map for gaussian kernel is unbiased.

\[\phi_m^{+}(x)\phi_m^{+}(y)^T = e^{-||x||^2}e^{-||y||^2}\frac{1}{m}\sum_{i=1}^{m}{\exp(w_i^T(x+y))}\]

\[\begin{aligned} E[\phi_m^{+}(x)\phi_m^{+}(y)^T] &= E\big[e^{-||x||^2}e^{-||y||^2}\frac{1}{m}\sum_{i=1}^{m}{\exp(w_i^T(x+y))}\big]\\ &=e^{-||x||^2}e^{-||y||^2}E\big[\frac{1}{m}\sum_{i=1}^{m}{\exp(w_i^T(x+y))}\big]\\ &=e^{-||x||^2}e^{-||y||^2}E_w\big[\exp(w^T(x+y))\big] \end{aligned} \]

\[\begin{aligned}\because E_w\big[\exp(w^T(x+y))\big] &= \int{e^{w^T(x+y)}dP(w)}\\ &= \int_{-\infty}^{+\infty}{e^{w^Tz}\cdot \frac{1}{\sqrt{2\pi}}\cdot e^{-\frac{||w||^2}{2}}dw}\\ &= \frac{1}{\sqrt{2\pi}} e^{\frac{||z||^2}{2}} \int_{-\infty}^{+\infty}{e^{-\frac{||z-w||^2}{2}}dw}\\ &= \frac{1}{\sqrt{2\pi}} e^{\frac{||z||^2}{2}} \int_{-\infty}^{+\infty}{e^{-\frac{||a||^2}{2}}da} \\ &= \frac{1}{\sqrt{2\pi}} e^{\frac{||z||^2}{2}}\cdot \sqrt{2\pi} \\ &= e^{\frac{||z||^2}{2}}\\ &= e^{\frac{||x+y||^2}{2}}\\ &= e^{\frac{||x||^2}{2}}e^{\frac{||y||^2}{2}}e^{x^Ty} \end{aligned}\]

\[\begin{aligned} \therefore E[\phi_m^{+}(x)\phi_m^{+}(y)^T] &= e^{-||x||^2}e^{-||y||^2}e^{\frac{||x||^2}{2}}e^{\frac{||y||^2}{2}}e^{x^Ty}\\ &= e^{-\frac{||x||^2}{2}}e^{-\frac{||y||^2}{2}}e^{x^Ty}\\ &= \exp(-\frac{||x-y||^2}{2}) \\ &= K_{Gauss}(x,y) \end{aligned}\]

豁蒙楼

Data-Mining-Midterm