General scheme
Approximating the kernel \(K(x,y) : \mathbb{R}^d \times \mathbb{R}^d \to \mathbb{R}\)
\(\phi: \mathbb{R}^d \to \mathbb{R}^m\) randomized
\(K(x,y) = E[\phi(x)^T\phi(y)]\) \[\phi(x) = \frac{h(x)}{\sqrt{m}}\begin{bmatrix}f_1(w_1^Tx), ..., f_1(w_m^Tx)\\f_2(w_1^Tx), ..., f_2(w_m^Tx)\\...\\f_l(w_1^Tx), ..., f_l(w_m^Tx)\end{bmatrix}\] for some \(h: \mathbb{R}^d \to \mathbb{R}\)
Example:
- Dot-product kernel, \(l = 1\), \(f_1(x)=x\), \(h(x)=1\), \(\Omega = N(0, I_d)\)
- Gaussian kernel, \(l=2\), \(f_1(x)=\cos(x), f_2(x)=\sin(x)\), \(\Omega = N(0, I_d)\)
- Softmax kernel, \(l=2\), \(f_1(x)=\cos(x), f_2(x)=\sin(x)\), \(\Omega = N(0, I_d)\), renormalizer \(h(x) = e^{\frac{||x||^2}{2}}\)
- Angular kernel, \(l=1, f_1(x)=sgn(x), h(x)=1, \Omega = N(0,I_d)\)
Orthogonal Random Features
\(w_1, ..., w_m \sim \Omega\), construct pdf function on a fixed isotopic sphere
ORFs:
\(\to w_i \sim \Omega\)
\(\to w_i^Tw_j = 0\) with prob 1 for \(i \neq j\)
- Why do we need ORFs?
- How to construct them?
We use ORFs reduce the variance of estimator. Then each new \(w_i\) provides new information. We can have fewer features.
Recipe for constucting ORFs(Only for \(\Omega\)-isotopic)
Only when \(m\leq d\)
- Constuct \(w_1, ..., w_m \sim iid \Omega\)
- Construct matrix G:
\[ G = \begin{pmatrix}w_1^T\\...\\w_m^T\end{pmatrix} \in \mathbb{R}^{m\times d}\]
Conduct Gram-Schmidt orthogon of \(G_{ort}\) to get with L2-normalized rows
Renormalize each row of \(G_{ort}\) simply by multiplying it with a scalar random variable \(X\sim \Omega\), distribution of lengths of vectors taken from \(\Omega\)
Proxies
\(G_{ort} \to ?\)
- Random-Hadamand Matrices \[ G_{ort} = (\frac{1}{\sqrt{d}}HD_1)...(\frac{1}{\sqrt{d}}HD_k) \] for some \(k\in N_{+}\)
\[ D_i = diag(\tau_i^1, ... \tau_i^d) \] \[\tau_i \sim iid \{-1, +1\} \]
H is a Hadamand matrix (to be more precise, H is called a Kronecker-Hadamand matrix) \[ H_0 = [1] \in \mathbb{R}^{1\times 1} \] \[ H_{t+1} = \begin{bmatrix} H_t, H_t\\ H_t, -H_t \end{bmatrix} \]
- \(H_t \in \mathbb{R}^{2^t \times 2^t}\) has orthogonal rows/columns.
- \(H_t\) is symmetric
\[ G_{ort} \in \mathbb{R}^{d \times d} = (\frac{1}{\sqrt{d}}HD_1)...(\frac{1}{\sqrt{d}}HD_k) \] H have to be \(\mathbb{R}^{d \times d}\), what if not power of 2? Zero padding to our input data.
- Larger k implies better approximation of the original mechanism, \(\frac{1}{\sqrt{d}}\) is normalized orthogonal, doing this for k times
- Givens rotations
\[ G_{ort} = Giv_1 Giv_2...Giv_k, k = O(d\log{d}) \]
Givens rotations: ... diagonal 1, where i, j exhibits \[ \begin{bmatrix} \cos{\theta}, \sin{\theta}\\ -\sin{\theta}, \cos{\theta}\end{bmatrix} \]
The ith and jth rotated \(\theta\)
Neural Networks & Attention Mechanism
Our object of interest: sequential data \((x_1, ..., x_L)\) -> L-tuple, each \(\in \mathbb{R}^d\)
Examples
text data, where \(x_i\)s correspond to embeddings of words.
music data, where \(x_i\)s correspond to embeddings of notes.
bio-informatics, protein-chains, embeddings of amino-acids.
Self-Attention
\((x_1, ..., x_L)\), understanding how the tokens attend to each other
similarity functions: \(K(x_i, x_j) \to \mathbb{R}\)
Attention matrix A
\(A \in \mathbb{R}^{L \times L}\)
\(A_{i,j} = K(x_i, x_j)\)
a.k.a kernel matrix corresponding to Kernel sequence \((x_1, ..., x_L)\)
usually fixed kernel, \(K(x_i, x_j) = SM(x_i, x_j) = e^{x_i^Tx_j}\)
we don't want the value to be exponentially large.
By renormalizing, \(e^{x_i^Tx_j} \to e^{\frac{x_i^Tx_j}{\sqrt{d}}}\)
\(K(x_i, x_j) = e^{(w_1x_i)^T(w_2x_j)}\)
Thinking about xis as row vectors,
e{(w_1x_i)T(w_2x_j)}$
\(K(x_i, x_j) = SM(x_i, x_j) = e^{x_ix_j^T}\)
\[W_Q, W_k: K_{W_Q, W_k}(x_i, x_j) = SM(x_iW_Q, x_jW_k) \] \[ = e^{x_iW_QW_K^Tx_j^T} \]
\(q_i = x_iW_Q\), query-vector corresponding to \(x_i\) \(k_j = x_jW_k\), key-vector corresponding to \(x_i\)
use \(x_i\) to calculate \(q_i\), \(k_i\), see how \(q_i\) attends to \(k_j\),\(q_j\) attends to \(k_i\)
\(A_{i,j} = e^{\frac{q_ik_j^T}{\sqrt{d}}}\)
L(L-tuple) -> [attention block] -> another L-tuple -> [attention block2] ...
...
Output of the attention: L AV = [A(L*L) * [V(L*d) \(V_i = X_i \cdot W_V\)
Attention Block version I
Input: \(X\)
Hyperparams: \(W_Q, W_K, W_V\)
Compute \(Q = XW_Q \in \mathbb{R}^{L\times d}\), and so are \(K, V\)
Output: \(AV\) where \(A = e^{\frac{QK^T}{\sqrt{d}}}\)
Attention block is parametrized by 3 matrices.
\(x_i\) to \(q_i, k_i, v_i\)