Data-Mining-Lec3

General scheme

Approximating the kernel $K(x,y) : \mathbb{R}^d \times \mathbb{R}^d \to \mathbb{R}$

$\phi: \mathbb{R}^d \to \mathbb{R}^m$ randomized

$K(x,y) = E[\phi(x)^T\phi(y)]$ \[\phi(x) = \frac{h(x)}{\sqrt{m}}\begin{bmatrix}f_1(w_1^Tx), ..., f_1(w_m^Tx)\\f_2(w_1^Tx), ..., f_2(w_m^Tx)\\...\\f_l(w_1^Tx), ..., f_l(w_m^Tx)\end{bmatrix}\] for some $h: \mathbb{R}^d \to \mathbb{R}$

Example:

Dot-product kernel, $l = 1$, $f_1(x)=x$, $h(x)=1$, $\Omega = N(0, I_d)$
Gaussian kernel, $l=2$, $f_1(x)=\cos(x), f_2(x)=\sin(x)$, $\Omega = N(0, I_d)$
Softmax kernel, $l=2$, $f_1(x)=\cos(x), f_2(x)=\sin(x)$, $\Omega = N(0, I_d)$, renormalizer $h(x) = e^{\frac{||x||^2}{2}}$
Angular kernel, $l=1, f_1(x)=sgn(x), h(x)=1, \Omega = N(0,I_d)$

Orthogonal Random Features

$w_1, ..., w_m \sim \Omega$, construct pdf function on a fixed isotopic sphere

ORFs:

$\to w_i \sim \Omega$

$\to w_i^Tw_j = 0$ with prob 1 for $i \neq j$

Why do we need ORFs?
How to construct them?

We use ORFs reduce the variance of estimator. Then each new $w_i$ provides new information. We can have fewer features.

Recipe for constucting ORFs(Only for $\Omega$-isotopic)

Only when $m\leq d$

Constuct $w_1, ..., w_m \sim iid \Omega$
Construct matrix G:

\[ G = \begin{pmatrix}w_1^T\\...\\w_m^T\end{pmatrix} \in \mathbb{R}^{m\times d}\]

Conduct Gram-Schmidt orthogon of $G_{ort}$ to get with L2-normalized rows
Renormalize each row of $G_{ort}$ simply by multiplying it with a scalar random variable $X\sim \Omega$, distribution of lengths of vectors taken from $\Omega$

Proxies

$G_{ort} \to ?$

Random-Hadamand Matrices \[ G_{ort} = (\frac{1}{\sqrt{d}}HD_1)...(\frac{1}{\sqrt{d}}HD_k) \] for some $k\in N_{+}$

\[ D_i = diag(\tau_i^1, ... \tau_i^d) \] \[\tau_i \sim iid \{-1, +1\} \]

H is a Hadamand matrix (to be more precise, H is called a Kronecker-Hadamand matrix) \[ H_0 = [1] \in \mathbb{R}^{1\times 1} \] \[ H_{t+1} = \begin{bmatrix} H_t, H_t\\ H_t, -H_t \end{bmatrix} \]

$H_t \in \mathbb{R}^{2^t \times 2^t}$ has orthogonal rows/columns.
$H_t$ is symmetric

\[ G_{ort} \in \mathbb{R}^{d \times d} = (\frac{1}{\sqrt{d}}HD_1)...(\frac{1}{\sqrt{d}}HD_k) \] H have to be $\mathbb{R}^{d \times d}$, what if not power of 2? Zero padding to our input data.

Larger k implies better approximation of the original mechanism, $\frac{1}{\sqrt{d}}$ is normalized orthogonal, doing this for k times

Givens rotations

\[ G_{ort} = Giv_1 Giv_2...Giv_k, k = O(d\log{d}) \]

Givens rotations: ... diagonal 1, where i, j exhibits \[ \begin{bmatrix} \cos{\theta}, \sin{\theta}\\ -\sin{\theta}, \cos{\theta}\end{bmatrix} \]

The ith and jth rotated $\theta$

Neural Networks & Attention Mechanism

Our object of interest: sequential data $(x_1, ..., x_L)$ -> L-tuple, each $\in \mathbb{R}^d$

Examples

text data, where $x_i$s correspond to embeddings of words.
music data, where $x_i$s correspond to embeddings of notes.
bio-informatics, protein-chains, embeddings of amino-acids.

Self-Attention

$(x_1, ..., x_L)$, understanding how the tokens attend to each other

similarity functions: $K(x_i, x_j) \to \mathbb{R}$

Attention matrix A

$A \in \mathbb{R}^{L \times L}$

$A_{i,j} = K(x_i, x_j)$

a.k.a kernel matrix corresponding to Kernel sequence $(x_1, ..., x_L)$

usually fixed kernel, $K(x_i, x_j) = SM(x_i, x_j) = e^{x_i^Tx_j}$

we don't want the value to be exponentially large.

By renormalizing, $e^{x_i^Tx_j} \to e^{\frac{x_i^Tx_j}{\sqrt{d}}}$

$K(x_i, x_j) = e^{(w_1x_i)^T(w_2x_j)}$

Thinking about xis as row vectors,

e^{(w_1x_i)T(w_2x_j)}$

$K(x_i, x_j) = SM(x_i, x_j) = e^{x_ix_j^T}$

\[W_Q, W_k: K_{W_Q, W_k}(x_i, x_j) = SM(x_iW_Q, x_jW_k) \] \[ = e^{x_iW_QW_K^Tx_j^T} \]

$q_i = x_iW_Q$, query-vector corresponding to $x_i$ $k_j = x_jW_k$, key-vector corresponding to $x_i$

use $x_i$ to calculate $q_i$, $k_i$, see how $q_i$ attends to $k_j$,$q_j$ attends to $k_i$

$A_{i,j} = e^{\frac{q_ik_j^T}{\sqrt{d}}}$

L(L-tuple) -> [attention block] -> another L-tuple -> [attention block2] ...

...

Output of the attention: L AV = [A(L*L) * [V(L*d) $V_i = X_i \cdot W_V$

Attention Block version I

Input: $X$

Hyperparams: $W_Q, W_K, W_V$

Compute $Q = XW_Q \in \mathbb{R}^{L\times d}$, and so are $K, V$

Output: $AV$ where $A = e^{\frac{QK^T}{\sqrt{d}}}$

Attention block is parametrized by 3 matrices.

$x_i$ to $q_i, k_i, v_i$

豁蒙楼