CS229-LinearAlgebra Recap

CS229: Review of Linear Algebra

Operations and properties

Norms

Euclidean or \(l_2\) norm;

More formally, a norm can be any function \(f: R^n \to R\) that satisfies 4 properties:

  • Non-negativity: \(f(x)\geq 0\)
  • Definiteness: \(f(x)=0\) iff \(x=0\)
  • Homogeneity: \(f(tx) = |t|f(x)\)
  • Triangle inequality: \(f(x+y) \leq f(x) + f(y)\)

Comment: According to the Bochner's Theorem, a kernel \(K(x,y): R^d \times R^d \to R\) should have property that \(\kappa(X) = K(X^T, X)\) is PSD.

\(l_1\) norm: \(\sum_{i}{|x_i|}\)

\(l_{\infty}\) norm: \(\max_i{x_i}\). This is because other dimensions diminishes.

  • General form of \(l_p\) norm

\[||x||_p = \bigg(\sum_{i=1}^{n}{|x_i|^p}\bigg)^{\frac{1}{p}}\]

  • Norm definition for Matrices: Frobenius norm

\[||A||_F = \sqrt{\sum_{i,j}{A_{ij}^2} } = \sqrt{\textrm{tr}(A^TA)}\]

Linear independence and rank

Linearly dependent:

\[ x_n = \sum_{i=1}^{n-1}{\alpha_ix_i}\]

Column rank: # of linearly independent columns of A

Row rank is similarly defined and equals to column rank. Proof

Both called rank.

  • Full rank if the \(=\min(m,n)\)
  • rank(\(A\)) = rank(\(A^T\))
  • rank(\(AB\)) \(\leq\) min(rank(\(A\)), rank(\(B\)))
  • rank(\(A+B\)) \(\leq\) rank(\(A\)) + rank(\(B\))

Comment: Proof \[rank(A) = dim(span(a_1, ..., a_n))\] Use this to easily get the 4th property.

Orthogonal Matrices

\[U^TU = I = UU^T\]

Comment: This generally requires that, all the column vectors of \(U\) has norm \(1\) and orthogonal to any other columns. This property requires that \(n\leq m\) because the dimension of the column space cannot surpass its dimensionality \(m\). And vice versa. Therefore only square matrices can satisfy both.

Property: Norm immutability

\[||Ux||_2 = ||x||_2\]

Comment: Expand by definition and find out that the cross quadratic terms equals 0.

Range and Nullspace of a Matrix

Span(线性生成空间): all linear combinations of columns

Projection of \(y\) onto the span of \(\{x_1, ..., x_n\}\): minimizes the euclidean norm of the residual

\[Proj(y; {x_1,...,x_n}) = argmin_{v\in span(X)}||y-v||_2\]

Range/Columnspace: span of columns of matrix

\[R(A) = \{ v \in \mathbb{R}^m: v = Ax, x\in \mathbb{R}^n \}\]

Projection of \(y\in \mathbb{R}^m\) on to the range of \(A\) is given by \[Proj(y; A) = argmin_{v\in R(A)}||v-y||_2 = A(A^TA)^{-1}A^Ty\]

Proof: 最小二乘法与投影

When A is a column vector, the expression regenerates to \(\frac{aa^T}{a^Ta}y\)

Nullspace:

\[N(A) = \{x\in\mathbb{R}^n: Ax=0\}\]

Comment: \(x\) is a coordinate and \(A\) is a set of column vectors. Certain \(x\) could map to a zero vector.

Orthogonal complements(正交补):

Note that \(R(A)\subseteq \mathbb{R}^n\) while \(N(A) \subseteq \mathbb{R}^m\). This is because, range is linear combination of \(n\) \(\mathbb{R}^m\) column vectors, while \(nullspace\) consists of the coordinates \(x \in \mathbb{R}^n\).

\(R(A^T)\) and \(N(A)\) are both in \(\mathbb{R}^n\). It can be proved that these two are disjoint subsets that together span the entire \(\mathbb{R}^n\) space.

\[\{ w:w=u+v, u\in R(A^T), v\in N(A)\} = \mathbb{R}^n, \quad R(A^T)\cap N(A) = \{0\} \]

Comment:

\[w = u+v = A^Tx +y, \quad Ay = 0, \quad x \in \mathbb{R}^m, y,w \in \mathbb{R}^n \]

\[Aw = AA^Tx\]

\[rank(AA^T) = rank(A)\]

Determinant(\(A\in \mathbb{R}^{n\times n}\))

Set of points \(S \subset \mathbb{R}^n\): \[S = \{ v\in \mathbb{R}^n: v = \sum_{i=1}^{n}{\alpha_ia_i}, \quad 0\leq \alpha_i \leq 1 \}\]

S is a restriction of span. |A| measures the "volume" of the set S.

Consider \[\begin{bmatrix} a_1^T \\ a_2^T \end{bmatrix} = \begin{bmatrix} 1 \quad 3 \\ 3 \quad 2 \end{bmatrix}\]

The area of the parrelogram: \[\sqrt{||a_1||^2||a_2||^2 - ||a_1\cdot a_2||^2}\]

  1. The determinant of the identity is 1, |I| = 1. (Geometrically, the volume of a unit hypercube is 1).
  2. Given a matrix \(A ∈ R^{n×n}\), if we multiply a single row in \(A\) by a scalar \(t ∈ R\), then the determinant of the new matrix is \(t|A|\)(Geometrically, multiplying one of the sides of the set \(S\) by a factor \(t\) causes the volume to increase by a factor \(t\).)
  3. If we exchange any two rows of \(A\), then the determinant of the new matrix is \(−|A|\)

For \(A,B\in \mathbb{R}^{n\times n}\), - \(|A| = |A^T|\) - \(|AB| = |A||B|\) - \(|A|=0\) iff \(A\) is singular(non-invertible). - For non-singular \(|A|^{-1} = 1/|A|\)

Recursive formula for determinant Laplace expansion Complete expansion has \(n!\) terms!

\[|A| = \sum_{i=1}^{n}{a_{ij}|A_{\backslash i,\backslash j}|}, \forall j\]

(Classical) adjoint

\[(adj(A))_{ij} = (-1)^{i+j}|A_{\backslash j,\backslash i}|\]

\[A^{-1} = \frac{1}{|A|}adj(A)\]

Quadratic Forms and PSD matrices

\[x^TAx = \sum_{i=1}^{n}\sum_{j=1}^{n}{A_{ij}x_ix_j}\]

Note that only the symmetric part contributes so generally we assume that \(A\) is symmetric.

\[x^TAx = (x^TAx)^T = x^TA^Tx = x^T(\frac12A + \frac12A^T)x\]

Property: PD/ND matrices are full rank and invertible.

Proof: if not, \(\exist x \neq 0, Ax=0\), then \(x^TAx=0\)

Gram matrix: one type of PD matrix.

\[G = A^TA\]

\(G\) is PSD, and if \(m\geq n\) and full rank, G is PD

Eigenvalues and Eigenvectors

\[Ax = \lambda x, x\neq 0\]

if \(x\) is a eigenvector to eigenvalue \(\lambda\), then \(cx\) is also a eigenvector. So we can normalize the eigenvectors.

\[(\lambda I - A)x = 0, x\neq 0\] This has non-zero solution iff \((\lambda I - A)\) has a non-empty nullspace. \[|\lambda I - A| = 0\]

Properties of eigenvalues/eigenvectors

  • trace equals to sum of eigenvalues
  • determinant equals to product of eigenvalues
  • rank equals to # of non-zero eigenvalues
  • for one pair, \(A^{-1}x = (1/\lambda)x\)
  • diagonal matrix: diagonal are eigen values

Eigenvalues and Eigenvectors for symmetric matrices

If \(A\) is symmetric, 1. All eigenvalues are real 2. exist unit and orthogonal eigenvectors

Let \[U = [u_1, ..., u_n]\]

\(U\) is orthogonal so \(UU^T=I\).

\[\Lambda = diag(\lambda_1, ..., \lambda_n)\]

\[AU = [Au_1, ... Au_n] = [\lambda_1u_1, ... \lambda_nu_n] = U\Lambda\]

\[A = AUU^T = U\Lambda U^T\]

This is just the diagonalization of the symmetric matrix A.

Comment: When PD a.k.a all eigenvalues positive, \(A = (U\Lambda^{\frac12})(U\Lambda^{\frac12})^T\)

Representing vector w.r.t. another basis Switching coordinate system to orghogonal

\[x = U\hat{x}\]

\[\hat{x} = U^Tx\]

This indicates \(U^Tx\) is another representation of x.

Diagonalizing matrix-vector multiplication

\[\hat{z} = U^Tz = U^TAx = U^TU\Lambda U^Tx = \Lambda \hat{x}\]

Then matrix multiplication is simpler because it merely scales each coordinate by the corresponding eigenvalue.

Especially when you left multiply one \(A^n\)

\[\hat{q} = U^Tq = U^TAAAx = U^TU\Lambda U^TU\Lambda U^TU\Lambda U^Tx = \Lambda^3\hat{x} \]

Comment: Time complexity of diagonalization is \(O(N^3)\)

Diagonalizing quadratic form

\[x^TAx = x^TU\Lambda U^Tx = \hat{x}^T \Lambda \hat{x} = \sum_{i=1}^{n}{\lambda_i \hat{x}_i^2}\]

Then the dedfiniteness of A is decided by the sign of its eigenvalues.

W.r.t maximizing the quadratic form, \(s.t. ||\hat{x}||_2^2 = 1\)

Upper bound is all weights on the largest eigenvalue \(\lambda_1\). Then \(x=u_1\).

Matrix Calculus

The Gradient

matrix of partial derivatives,

\[(\nabla_{A}f(A))_{ij} = \frac{\partial f(A)}{\partial A_{ij}}\]

Keep the notation clear by declaring the differentiated variable (\(z\) for the whole chunk)

The Hessian

\[(\nabla_x^2f(x))_{ij} = \frac{\partial^2f(x)}{\partial x_i\partial x_j} \]

Symmetric (\(\mathbb{R}^{n\times n}\), real valued)

Caveat:

We cannot see the Hessian as the gradient of the gradient, because this is undefined:

\[\nabla_x\nabla_xf(x) = \nabla_x \begin{bmatrix} \frac{\partial f(x)}{\partial x_1}\\ ...\\ \frac{\partial f(x)}{\partial x_n} \end{bmatrix}\]

However, the column of the Hessian can be defined as

\[\nabla_x\nabla \frac{\partial f(x)}{\partial x_i} = \begin{bmatrix} \frac{\partial^2 f(x)}{\partial x_i \partial x_1}\\ ...\\ \frac{\partial^2 f(x)}{\partial x_i \partial x_n} \end{bmatrix}\]

Therefore, Hessian can be defined by (fix one and derive another)

\[\nabla_x^2f(x) = [\nabla_x(\nabla_xf(x))_1, ...\nabla_x(\nabla_xf(x))_n]\]

Gradients and Hessians of Quadratic and Linear Forms

Analogous to scalars. Note that the multiplier is transposed.

\[\nabla_x b^Tx = b\]

Least Squares

Before we derived the expression 最小二乘法与投影

We want to minimize the residual when doing projection. This time we use derivative.

\[||Ax-b||_2^2 = (Ax-b)^T(Ax-b) = x^TA^TAx-2b^TAx+b^2\]

\[\nabla_x(||Ax-b||_2^2) = 2A^TAx-2A^Tb\]

\[x = (A^TA)^{-1}A^Tb\]

Gradients of the Determinent

For symmetric \(A\),

\[\nabla_A log|A| = \frac{1}{|A|}\nabla_A|A| = A^{-1}\]

The proof uses the adjoint(伴随矩阵) and cofactor(代数余子式).

Eigenvalus as optimization

Before in the eigenvalue and eigenvector part we discussed maximizing a quadratic form with restricted norm.

\[\max_{x\in\mathbb{R}^n} x^TAx, s.t. ||x||_2^2=1\]

From the angle of Lagrangian multiplier

\[L(x, \lambda) = x^TAx - \lambda(x^Tx-1)\]

\[\nabla_x L(x,\lambda) = \nabla_x(x^TAx - \lambda x^Tx) = 2Ax - 2\lambda x = 0\]

\[Ax = \lambda x\]

Lagrangian multiplier method ensures that the optimal solution of the original is within the solution of the Lagrangian function.拉格朗日乘数法所得的极点会包含原问题的所有极值点,但并不保证每个极值点都是原问题的极值点。

This indicates that the optimal solution \(x\) of this problem must be one eigenvector of \(A\).