CS229 Notes1 Supervised Learning

Part I Linear Regression

Try to train the hypothesis function \(h\).

Letting \(x_0=1\) gives the error term. \[h(x) = \sum_{i=0}^d{\theta_ix_i}=\theta^T x\]

Cost function w.r.t parameter \(\theta\):

\[J(\theta) = \frac12\sum_{i=1}^{n}{(h_\theta(x^{(i)}) - y^{(i)})^2}\]

LMS algorithm

Gradient descent: \[\theta_j := \theta_j - \alpha \frac{\partial}{\partial\theta_j}J(\theta)\]

\(\alpha\) is the learning rate.

Here, \[\frac{\partial}{\partial\theta_j}J(\theta) = (h_\theta(x) - y)x_j\]

For a single training example, the learning rule is LMS(least squared error) a.k.a Widrow-Hoff

\[\theta_j := \theta_j + \alpha (y - h_\theta(x))x_j\]

Into vector level,

\[\theta := \theta + \alpha {(y^{(i)} - h_\theta(x^{(i)}))x^{(i)}}\]

One sample each time, we call this SGD(strochastic gradient descent).

About applying the rule to a training set, - SGD - batch gradient descent

\[\theta := \theta + \alpha \sum_{i=1}^{n}{(y^{(i)} - h_\theta(x^{(i)}))x^{(i)}}\]

SGD vs BGD: Fast / Robust

The normal equations

Matrix derivatives (elementwise)

... ### Least squares revisited (vectorized)

The design matrix \(X\)

\[X = \begin{bmatrix} (x^{(1)})^T\\ (x^{(2)})^T\\...\\(x^{(n)})^T\end{bmatrix} \in \mathbb{R}^{n\times(d+1)}\]

\[y = \begin{bmatrix} y^{(1)}\\ y^{(2)}\\...\\y^{(n)}\end{bmatrix}\]

\[X\theta - y = \begin{bmatrix} h_\theta(x^{(1)})-y^{(1)}\\ h_\theta(x^{(2)})-y^{(2)}\\...\\h_\theta(x^{(n)})-y^{(n)}\end{bmatrix}\]

\[J(\theta) = \frac12(X\theta-y)^T(X\theta-y)\]

\[\nabla_\theta J(\theta) = X^TX\theta - X^Ty\]

To minimize \(J\), we set its derivatives to zero, and obtain normal equations:

\[X^TX\theta = X^Ty\]

Thus, the value of \(\theta\) that minimizes the \(J(\theta)\) is given in closed form by the equation

\[\theta = (X^TX)^{-1}X^Ty\]

Probablistic interpretation

Assume

\[y^{(i)} = \theta^Tx^{(i)} + \epsilon^{(i)} \]

Assume that the error term \(\epsilon \sim N(0, \sigma^2)\)

The density function of $^{(i)} $ is given by

\[p(\epsilon^{(i)}) = \frac{1}{\sqrt{2\pi}\sigma} e^{-\frac{(\epsilon^{(i)})^2}{2\sigma^2}}\]

Substitute \(\epsilon^{(i)}\) with \(\theta^Tx^{(i)} -y^{(i)}\)

\[p(y^{(i)} | x^{(i)}; \theta) = \frac{1}{\sqrt{2\pi}\sigma} e^{-\frac{(\theta^Tx^{(i)} -y^{(i)})^2}{2\sigma^2}}\]

Note that we are not conditioning on \(\theta\), because \(\theta\) is not a random variable. The distribution of \(y^{(i)}\) can also be written as \(y^{(i)}|x^{(i)}; \theta \sim N(0, \sigma^2)\)

By seeing the conditional distribution as a function of \(\theta\), we define the likelihood function:

\[L(\theta) = \prod_{i=1}^{n}{\frac{1}{\sqrt{2\pi}\sigma} e^{-\frac{(\theta^Tx^{(i)} -y^{(i)})^2}{2\sigma^2}}}\]

How to best guess the parameter \(\theta\)? Maximum likelihood --- to make the data as high-probability as possible.

Comment: 频率学派(Frequentist), MLE, "Seeing is believing"

Maximize the log likelihood or minimize the negative log likelihood (NLL).

\[\begin{aligned}l(\theta) &= \log{L(\theta)}\\ &= n\log{\frac1{\sqrt{2\pi}\sigma}} - \frac1{\sigma^2}\cdot \frac12\sum_{i=1}^{n}(\theta^Tx^{(i)} -y^{(i)})^2 \end{aligned}\]

Hence it gives the same answer as minimizing the cost function.

Note that here the choice of \(\theta\) does not rely on \(\sigma^2\)

Locally weighted linear regression (LWR)

Instead of fitting \(\theta\) to minimize \(\sum_i{(y^{(i)}-\theta^Tx^{(i)})^2}\), minimize

\[\sum_iw^{(i)}{(y^{(i)}-\theta^Tx^{(i)})^2}\]

The standard choice for the weights:

\[w^{(i)} = \exp\bigg( -\frac{(x^{(i)}-x)^2}{2\tau^2} \bigg)\]

Comment: part of pdf of \(N(x, \tau^2)\), but has nothing to do with gaussian. Just measures the distance, lower weights to distant data points. \(tau\) is called the bandwidth parameter which controls how quickly the weight diminishes with distance increasing.

LWR is non-parametric algorithm. It means the model relies on the training data and scales with bigger size of the training dataset.

Part II Classification and logistic regression

Problem: binary classification Keywords: positive/negative class, label

Logistic regression

To limit the predicted value of \(y\) to \([0,1]\), use sigmoid/logitstic function

\[h_\theta(x) = g(\theta^Tx) = \frac1{1 + e^{-\theta^Tx}}\]

Sigmoid's derivative has such property

\[g'(z) = g(z)(1-g(z))\]

Probablistic assumptions:

\[P(y=1 | x;\theta) = h_\theta(x)\]

\[P(y=0 | x;\theta) = 1-h_\theta(x)\]

Or more compactly written:

\[P(y| x;\theta) = (h_\theta(x))^y(1-h_\theta(x))^{1-y}\]

Then we can build a likelihood function for \(n\) training examples

\[\begin{aligned}L(\theta) &= p(y|X; \theta)\\ &= \prod_{i=1}^{n}{(h_\theta(x^{(i)}))^{y^{(i)}}(1-h_\theta(x^{(i)}))^{1-y^{(i)}}} \end{aligned}\]

\[\begin{aligned}l(\theta) &= \log{L(\theta)}\\ &= \sum_{i=1}^{n}{y^{(i)}\log h(x^{(i)})+({1-y^{(i)}})\log(1-h(x^{(i)}))} \end{aligned}\]

Optimize the LL by Stochastic Gradient Ascent(maximizing):

\[\begin{aligned}\frac{\partial}{\partial \theta_j}l(\theta) &= \bigg(y\frac1{g(\theta^Tx)} + (1-y)\frac1{1-g(\theta^Tx)}\bigg)\frac{\partial}{\partial \theta_j}g(\theta^Tx)\\ &= \bigg(y\frac1{g(\theta^Tx)} + (1-y)\frac1{1-g(\theta^Tx)}\bigg)g(\theta^Tx)(1-g(\theta^Tx))\frac{\partial}{\partial \theta_j}\theta^Tx\\ &= \big(y(1-g(\theta^Tx)) - (1-y)g(\theta^Tx)\big)x_j\\ &= (y-h_\theta(x))x_j \end{aligned}\]

Update rule:

\[\theta_j := \theta_j + \alpha(y - h_\theta(x))x_j\]

This looks identical with LMS update rule but \(h_\theta(x)\) here is different. This is actually general rule.

Digression: perceptron learning algorithm

change \(g(z)\) to \(1\{z\geq 0\}\), still using the previous update rule.

Newton's method for maximizing \(l(\theta)\)

Newton's method for finding a zero of a function. \[\theta := \theta - \frac{f(\theta)}{f'(\theta)}\]

To maximize, we want the derivative \(l'(\theta) = 0\). So we set \(f(\theta) = l'(\theta)\) \[\theta := \theta - \frac{l'(\theta)}{l''(\theta)}\]

Generalize Newton's method to vector-valued/multidimensional setting (a.k.a Newton-Raphson Method):

\[\theta := \theta - H^{-1}\nabla_\theta l(\theta)\]

Complexity: \(O(N^2)\) for inverting \(d\times d\) Hessian. Faster than batch gradient descent.

When maximizing logistic regression, called Fisher scoring.

Part III Generalized Linear Models(GLMs)

In the Classification part, the derivative of sigmoid function is kind of relevant to Bernoulli distribution.

Both of Linear regression and calssification are special cases of a broader family of models in GLM family.

The exponential family

\[p(y; \eta) = b(y)\exp(\eta^TT(y) - a(\eta))\]

\(\eta\): natural/canonical parameter

\(T(y)\): sufficient statistic, often \(y\)

\(a(\eta)\): log partition function. \(\exp{a(\eta)}\) used to normalize so the distributino integrates into 1.

\(T\) and \(b\) dedfines a family of distributions parametrized by \(\eta\)

Bernoulli and Gaussian distributions are examples of exponential family distributions.

\[\begin{aligned} p(y;\phi) &= \phi^y (1-\phi)^{1-y}\\ &= \exp(y\log\phi + (1-y)\log(1-\phi))\\ &= \exp\bigg( \bigg( \log \bigg( \frac{\phi}{1-\phi} \bigg)y + \log(1-\phi) \bigg) \bigg) \end{aligned}\]

Exponential family distribution notations:

\[\begin{aligned} \eta &= \log(\phi/(1-\phi))\\ \phi &= \frac{1}{1+e^{-\eta}}\\ T(y) &= y \\ a(\eta) &= -log(1-\phi)\\ &= \log (1+e^{\eta})\\ b(y) &= 1 \end{aligned}\]

Notice that \(\phi\) is the sigmoid function.

For Gaussian distribution with \(\sigma^2 = 1\) \[\begin{aligned} p(y;\mu) &= \frac1{\sqrt{2\pi}}\exp\bigg(-\frac12(y-\mu)^2 \bigg)\\ &= \frac1{\sqrt{2\pi}}\exp\bigg(-\frac12y^2 + \mu y -\frac12\mu^2 \bigg) \\ &= \frac1{\sqrt{2\pi}}\exp\bigg(-\frac12y^2\bigg) \cdot \exp\bigg(\mu y -\frac12\mu^2\bigg) \end{aligned}\]

Compare with the exponential distribution family form \[p(y; \eta) = b(y)\exp(\eta^TT(y) - a(\eta))\]

We get

\[\begin{aligned} b(y) &= \frac1{\sqrt{2\pi}}\exp\bigg(-\frac12y^2\bigg)\\ \eta &= \mu\\ T(y) &= y \\ a(\eta) &= \mu^2/2\\ \end{aligned}\]

Constructing GLMs

"Poisson distribution usually gives a good model for numbers of visitors". (Counting Model)

Poisson is an exponential family distribution.

To derive a GLM, make the following 3 assumptions, 1. \(y|x; \theta\)~ExponentialFamily\((\eta)\). Given \(x\) and \(\eta\), the distribution of \(y\) follows some exponential family distribution with partameter \(\eta\) 2. Given \(x\), our goal is to predict \(T(y)\). In most cases \(T(y) = y\). We would like \(h_\theta(x) = E[T(y)|x;\theta]\). 3. Natural parameter \(\eta\) is linearly related with the inputs \(x\): \(\eta = \theta^Tx\). This is a "design choice"

OLS revisited using GLM construction

target variable \(y\) is also called reponse variable in GLM terminology.

\[y\sim N(\mu, \sigma^2)\] where \(\mu\) relies on \(x\).

Previously we have GLM for Gaussian, \(\mu = \eta\)

\[\begin{aligned} h_\theta(x) &= E[y|x;\theta]\\ &= \mu \\ &= \eta \\ &= \theta^Tx \end{aligned}\]

First equality from assumption 2. Third equality from assumption 1. Last from assumption 3.

Softmax Regression

Problem set - Multinomial distribution: multiple classification, \(y\in \{1, ..., k\}\)

\(T(y)\) is a \(d-1\) vector with y-th entry 1.

\[T(y)_i = 1\{y = i\}\]

\((T(y))_i\) is the i-th element of \(T(y)\)

\[E[(T(y))_i] = \phi_i\]

The multinomial is a member of the exponential family, where

\[\begin{aligned} b(y) &= 1\\ a(\eta) &= -\log(\phi_k)\\ \eta &= \begin{bmatrix} \log(\phi_1/\phi_k)\\ \log(\phi_2/\phi_k) \\...\\\log(\phi_{k-1}/\phi_k) \end{bmatrix} \end{aligned}\]

The link function is given by \[\eta_i = \log\frac{\phi_i}{\phi_k}\]

Invert the link function

\[e^{\eta_i} = \phi_i/\phi_k\]

\[\phi_ke^{\eta_i} = \phi_i\]

\[\phi_k\sum_{i=1}^k{e^{\eta_i}} = \sum_{i=1}^k{\phi_i}=1\]

\[\therefore \phi_k = \frac1{\sum_{i=1}^k{e^{\eta_i}}}\]

\[\therefore \phi_i = \frac{e^{\eta_i}}{\sum_{j=1}^k{e^{\eta_j}}}\]

This function mapping from \(\eta\) to \(\phi\) is called softmax.

From assumption 3, \(\eta_i = \theta_i^Tx\), and since we define \(\theta_k = 0\), \(\eta_k = 0\)

\[\begin{aligned} p(y=i|x; \theta) &= \phi_i \\ &= \frac{e^{\eta_i}}{\sum_{j=1}^k{e^{\eta_j}}} \\ &= \frac{e^{\theta_i^Tx}}{\sum_{j=1}^k{e^{\theta_j^Tx}}} \end{aligned}\]

This is the softmax regression. It's a generalization of logistic regression (binary classification).

\[\begin{aligned} h_\theta(x) &= E[T(y)|x; \theta] \\ &= [\phi_1 ,.. \phi_{k-1}]^T\\ &= \bigg[\frac{e^{\theta_1^Tx}}{\sum_{j=1}^k{e^{\theta_j^Tx}}}, ..., \frac{e^{\theta_{k-1}^Tx}}{\sum_{j=1}^k{e^{\theta_j^Tx}}}\bigg]^T\\ \end{aligned}\]

Comment: the output is the estimated probability for each class.

For parameter fitting, maximize the log likelihood using gradient ascent or Newton's method.