Transformers
"Attention is all you need"
Attention Block
a sequence of feature vectors \(X = [x_1, ... x_L]^T \in \mathbb{R}^{L\times d}\) are fed into an attention block.
\(W_Q, W_K, W_V\)
\(Q=XW_Q, K=XW_K, V=XW_V \in \mathbb{R}^{L\times d}\)
\(A = e^{\frac{QK^T}{\sqrt{d}}}\in \mathbb{R}^{L\times L}\)
Elementwise exponential.
\(AV \in \mathbb{R}^{L\times d}\)
Residual Connection
Skip connection, ResNet layer, (Residual connection), \(X + f(X)\) where \(f\) here is the attention block.
The adavantage of using Residual connection,...
Dynamical System: Initial condition as 0, revolutions,
\(\frac{dx}{dt} = F(t)\): neural ODEs
LayerNorm
After the attention block and skip connection, there is another LayerNorm layer. This normalization normalizes each input \(X = [x_1, ..., x_L]^T \to [\hat{x_i}...]^T\)
\[ \hat{x_i} = \frac{x_i - \frac{1}{d}_{j=i}x_i[j]}{stddev(x_1[1],...x_1[d])} \]
Other forms of norm: batch norm, ...
MLP
After the layernorm, we get \(\hat{X}\). Then it goes to multiple MLPs (MultiLayerPerceptron)
Perceptron: some computation unit. It takes some inputs \(y_1, y_2, y_3, y_4...\). Each channel has different weights\(w_1, w_2, ...\). The percepetron calculates the weighted sum \(\sum{w_iy_i}\) and applies some nonlinear mapping \(\sigma(\sum{w_iy_i}): R \to R\).
Some examples of \(\sigma\): Sigmoid: \(\sigma(x) = \frac{1}{1+e^{-x}} \in [0,1)\)
\(\tanh(x) = \frac{e^x-e^{-x}}{e^x+e^{-x}}\): resized version of sigmoid
Relu(rectified linear unit): \[y = \begin{cases} x, x>0 \\ 0, x\leq 0 \end{cases}\]
Multilayer: How to organize perceptrons
neurons in layer i broadcast signal to neurons in layer i+1.
weights of connections are learned(through optimizing)
Paralellization of different MLPs and finally layernorm
Positive random features for softmax and gaussian kernels
Valari Likhoshevstov
Attention matrix: A
A: rows q, columns k
\(A_{ij} = e^{q_ik_j^T}\)
renormalization: \[\hat{A_{ij}} = \frac{e^{q_ik_j^T}}{\sum_{s=1}^{L}{e^{q_ik_s^T}}}\] \[ e^{q_ik_j^T} = SM(q_i, k_j) = E[\phi(q_i)\phi(k_j)]\] where \[\phi(x)\stackrel{def}{=} e^{\frac{||x||^2}{2}} \frac{1}{\sqrt{m}} \begin{bmatrix}\cos(w_1^Tx)\\...\\\cos(w_m^Tx)\\\sin(w_1^Tx)\\...\\\sin(w_m^Tx) \end{bmatrix}\]
\[w_i, ...w_m \sim N(0, I_d)\]
small value matters: error. Renormalize helps smaller values become accurate.
\[MSE(\hat{SM}_m^{trig}(x,y)) = \frac{1}{2m} exp(||x+y||^2)SM^{-2}(x,y)\times (1-\exp(-||x-y||^2))^2\]
\[SM(x,y) = E_{\omega \sim N(0, I_d)}[\exp(w^Tx - \frac{||x||^2}{2})\exp(w^Ty - \frac{||y||^2}{2})]\]
\[\phi_m^{+}(x) = e^{-\frac{||x||^2}{2}}\frac{1}{\sqrt{m}}\begin{bmatrix}\exp(w_1^Tx)\\...\\\exp(w_m^Tx) \end{bmatrix}\]
\[\hat{SM}(x,y)^{+}_{m} = \phi_m^{+}(x)(\phi_m^{+}(y))^T\]
Key difference btw: \(\hat{SM}(x,y)^{trig}_{m}\) and \(\hat{SM}(x,y)^{+}_{m}\)
- \(\hat{SM}(x,y)^{trig}_{m}\) becomes arbitrarily accurate as \(x\to y\)
- \(\hat{SM}(x,y)^{+}_{m}\) becomes ... as \(SM(x,y) \to 0\)