Data-Mining-Lec5

Transformers

"Attention is all you need"

Attention Block

a sequence of feature vectors \(X = [x_1, ... x_L]^T \in \mathbb{R}^{L\times d}\) are fed into an attention block.

\(W_Q, W_K, W_V\)

\(Q=XW_Q, K=XW_K, V=XW_V \in \mathbb{R}^{L\times d}\)

\(A = e^{\frac{QK^T}{\sqrt{d}}}\in \mathbb{R}^{L\times L}\)

Elementwise exponential.

\(AV \in \mathbb{R}^{L\times d}\)

Residual Connection

Skip connection, ResNet layer, (Residual connection), \(X + f(X)\) where \(f\) here is the attention block.

The adavantage of using Residual connection,...

Dynamical System: Initial condition as 0, revolutions,

\(\frac{dx}{dt} = F(t)\): neural ODEs

LayerNorm

After the attention block and skip connection, there is another LayerNorm layer. This normalization normalizes each input \(X = [x_1, ..., x_L]^T \to [\hat{x_i}...]^T\)

\[ \hat{x_i} = \frac{x_i - \frac{1}{d}_{j=i}x_i[j]}{stddev(x_1[1],...x_1[d])} \]

Other forms of norm: batch norm, ...

MLP

After the layernorm, we get \(\hat{X}\). Then it goes to multiple MLPs (MultiLayerPerceptron)

Perceptron: some computation unit. It takes some inputs \(y_1, y_2, y_3, y_4...\). Each channel has different weights\(w_1, w_2, ...\). The percepetron calculates the weighted sum \(\sum{w_iy_i}\) and applies some nonlinear mapping \(\sigma(\sum{w_iy_i}): R \to R\).

Some examples of \(\sigma\): Sigmoid: \(\sigma(x) = \frac{1}{1+e^{-x}} \in [0,1)\)

\(\tanh(x) = \frac{e^x-e^{-x}}{e^x+e^{-x}}\): resized version of sigmoid

Relu(rectified linear unit): \[y = \begin{cases} x, x>0 \\ 0, x\leq 0 \end{cases}\]

Multilayer: How to organize perceptrons

  • neurons in layer i broadcast signal to neurons in layer i+1.

  • weights of connections are learned(through optimizing)

  • Paralellization of different MLPs and finally layernorm

Positive random features for softmax and gaussian kernels

Valari Likhoshevstov

Attention matrix: A

A: rows q, columns k

\(A_{ij} = e^{q_ik_j^T}\)

renormalization: \[\hat{A_{ij}} = \frac{e^{q_ik_j^T}}{\sum_{s=1}^{L}{e^{q_ik_s^T}}}\] \[ e^{q_ik_j^T} = SM(q_i, k_j) = E[\phi(q_i)\phi(k_j)]\] where \[\phi(x)\stackrel{def}{=} e^{\frac{||x||^2}{2}} \frac{1}{\sqrt{m}} \begin{bmatrix}\cos(w_1^Tx)\\...\\\cos(w_m^Tx)\\\sin(w_1^Tx)\\...\\\sin(w_m^Tx) \end{bmatrix}\]

\[w_i, ...w_m \sim N(0, I_d)\]

small value matters: error. Renormalize helps smaller values become accurate.

\[MSE(\hat{SM}_m^{trig}(x,y)) = \frac{1}{2m} exp(||x+y||^2)SM^{-2}(x,y)\times (1-\exp(-||x-y||^2))^2\]

\[SM(x,y) = E_{\omega \sim N(0, I_d)}[\exp(w^Tx - \frac{||x||^2}{2})\exp(w^Ty - \frac{||y||^2}{2})]\]

\[\phi_m^{+}(x) = e^{-\frac{||x||^2}{2}}\frac{1}{\sqrt{m}}\begin{bmatrix}\exp(w_1^Tx)\\...\\\exp(w_m^Tx) \end{bmatrix}\]

\[\hat{SM}(x,y)^{+}_{m} = \phi_m^{+}(x)(\phi_m^{+}(y))^T\]

Key difference btw: \(\hat{SM}(x,y)^{trig}_{m}\) and \(\hat{SM}(x,y)^{+}_{m}\)

  • \(\hat{SM}(x,y)^{trig}_{m}\) becomes arbitrarily accurate as \(x\to y\)
  • \(\hat{SM}(x,y)^{+}_{m}\) becomes ... as \(SM(x,y) \to 0\)