Data-Mining-Lec3

General scheme

Approximating the kernel \(K(x,y) : \mathbb{R}^d \times \mathbb{R}^d \to \mathbb{R}\)

\(\phi: \mathbb{R}^d \to \mathbb{R}^m\) randomized

\(K(x,y) = E[\phi(x)^T\phi(y)]\) \[\phi(x) = \frac{h(x)}{\sqrt{m}}\begin{bmatrix}f_1(w_1^Tx), ..., f_1(w_m^Tx)\\f_2(w_1^Tx), ..., f_2(w_m^Tx)\\...\\f_l(w_1^Tx), ..., f_l(w_m^Tx)\end{bmatrix}\] for some \(h: \mathbb{R}^d \to \mathbb{R}\)

Example:

  1. Dot-product kernel, \(l = 1\), \(f_1(x)=x\), \(h(x)=1\), \(\Omega = N(0, I_d)\)
  2. Gaussian kernel, \(l=2\), \(f_1(x)=\cos(x), f_2(x)=\sin(x)\), \(\Omega = N(0, I_d)\)
  3. Softmax kernel, \(l=2\), \(f_1(x)=\cos(x), f_2(x)=\sin(x)\), \(\Omega = N(0, I_d)\), renormalizer \(h(x) = e^{\frac{||x||^2}{2}}\)
  4. Angular kernel, \(l=1, f_1(x)=sgn(x), h(x)=1, \Omega = N(0,I_d)\)

Orthogonal Random Features

\(w_1, ..., w_m \sim \Omega\), construct pdf function on a fixed isotopic sphere

ORFs:

\(\to w_i \sim \Omega\)

\(\to w_i^Tw_j = 0\) with prob 1 for \(i \neq j\)

  1. Why do we need ORFs?
  2. How to construct them?

We use ORFs reduce the variance of estimator. Then each new \(w_i\) provides new information. We can have fewer features.

Recipe for constucting ORFs(Only for \(\Omega\)-isotopic)

Only when \(m\leq d\)

  1. Constuct \(w_1, ..., w_m \sim iid \Omega\)
  2. Construct matrix G:

\[ G = \begin{pmatrix}w_1^T\\...\\w_m^T\end{pmatrix} \in \mathbb{R}^{m\times d}\]

  1. Conduct Gram-Schmidt orthogon of \(G_{ort}\) to get with L2-normalized rows

  2. Renormalize each row of \(G_{ort}\) simply by multiplying it with a scalar random variable \(X\sim \Omega\), distribution of lengths of vectors taken from \(\Omega\)

Proxies

\(G_{ort} \to ?\)

  1. Random-Hadamand Matrices \[ G_{ort} = (\frac{1}{\sqrt{d}}HD_1)...(\frac{1}{\sqrt{d}}HD_k) \] for some \(k\in N_{+}\)

\[ D_i = diag(\tau_i^1, ... \tau_i^d) \] \[\tau_i \sim iid \{-1, +1\} \]

H is a Hadamand matrix (to be more precise, H is called a Kronecker-Hadamand matrix) \[ H_0 = [1] \in \mathbb{R}^{1\times 1} \] \[ H_{t+1} = \begin{bmatrix} H_t, H_t\\ H_t, -H_t \end{bmatrix} \]

  • \(H_t \in \mathbb{R}^{2^t \times 2^t}\) has orthogonal rows/columns.
  • \(H_t\) is symmetric

\[ G_{ort} \in \mathbb{R}^{d \times d} = (\frac{1}{\sqrt{d}}HD_1)...(\frac{1}{\sqrt{d}}HD_k) \] H have to be \(\mathbb{R}^{d \times d}\), what if not power of 2? Zero padding to our input data.

  • Larger k implies better approximation of the original mechanism, \(\frac{1}{\sqrt{d}}\) is normalized orthogonal, doing this for k times
  1. Givens rotations

\[ G_{ort} = Giv_1 Giv_2...Giv_k, k = O(d\log{d}) \]

Givens rotations: ... diagonal 1, where i, j exhibits \[ \begin{bmatrix} \cos{\theta}, \sin{\theta}\\ -\sin{\theta}, \cos{\theta}\end{bmatrix} \]

The ith and jth rotated \(\theta\)

Neural Networks & Attention Mechanism

Our object of interest: sequential data \((x_1, ..., x_L)\) -> L-tuple, each \(\in \mathbb{R}^d\)

Examples

  1. text data, where \(x_i\)s correspond to embeddings of words.

  2. music data, where \(x_i\)s correspond to embeddings of notes.

  3. bio-informatics, protein-chains, embeddings of amino-acids.

Self-Attention

\((x_1, ..., x_L)\), understanding how the tokens attend to each other

similarity functions: \(K(x_i, x_j) \to \mathbb{R}\)

Attention matrix A

\(A \in \mathbb{R}^{L \times L}\)

\(A_{i,j} = K(x_i, x_j)\)

a.k.a kernel matrix corresponding to Kernel sequence \((x_1, ..., x_L)\)

usually fixed kernel, \(K(x_i, x_j) = SM(x_i, x_j) = e^{x_i^Tx_j}\)

we don't want the value to be exponentially large.

By renormalizing, \(e^{x_i^Tx_j} \to e^{\frac{x_i^Tx_j}{\sqrt{d}}}\)

\(K(x_i, x_j) = e^{(w_1x_i)^T(w_2x_j)}\)

Thinking about xis as row vectors,

e{(w_1x_i)T(w_2x_j)}$

\(K(x_i, x_j) = SM(x_i, x_j) = e^{x_ix_j^T}\)

\[W_Q, W_k: K_{W_Q, W_k}(x_i, x_j) = SM(x_iW_Q, x_jW_k) \] \[ = e^{x_iW_QW_K^Tx_j^T} \]

\(q_i = x_iW_Q\), query-vector corresponding to \(x_i\) \(k_j = x_jW_k\), key-vector corresponding to \(x_i\)

use \(x_i\) to calculate \(q_i\), \(k_i\), see how \(q_i\) attends to \(k_j\),\(q_j\) attends to \(k_i\)

\(A_{i,j} = e^{\frac{q_ik_j^T}{\sqrt{d}}}\)

L(L-tuple) -> [attention block] -> another L-tuple -> [attention block2] ...

...

Output of the attention: L AV = [A(L*L) * [V(L*d) \(V_i = X_i \cdot W_V\)

Attention Block version I

Input: \(X\)

Hyperparams: \(W_Q, W_K, W_V\)

Compute \(Q = XW_Q \in \mathbb{R}^{L\times d}\), and so are \(K, V\)

Output: \(AV\) where \(A = e^{\frac{QK^T}{\sqrt{d}}}\)

Attention block is parametrized by 3 matrices.

\(x_i\) to \(q_i, k_i, v_i\)