# HW2

Hua Yao, UNI:hy2632

## Problem 1: Convolutional Neural Networks [20 points]

The input RGB- image has 3 channels. Apply Conv2D to each channel to get 3 feature maps.

When there's no padding, the shape of 3 feature maps are $$125\times125$$.

$O = \frac{I-k}{s} + 1 = \frac{128-4}{1} +1 = 125$

The number of parameters in the convolution layer is $$4^2 \times 3 = 48$$

The shape of 3 feature maps after max-pooling are $$61\times61$$ $O = \frac{I-k}{s} + 1 = \frac{125-5}{2} + 1 = 61$

The number of parameters in maxpooling is $$0$$.

After maxpooling, the fully connected layer maps $$I: 61\times61\times3 \to O:256$$, which takes $$61\times61\times3\times256=2857728$$ parameters.

The output layer maps $$I:256 \to O:10$$ and then softmax. This step takes $$256\times10=2560$$ parameters.

• Total number of parameters in the NN is $$48+0+2857728+2560=2860336$$.
• The shape of feature maps of Conv layer is $$125\times125$$, that of max-pooling is $$61\times61$$.
• The max-pooling layer consists of 3 feature maps (same number of channels as in the Conv layer, because max-pooling is channel-wise)

## Problem 2: Deep Residual Networks [40 points]

### 2.1 Form of partial derivative

$a_{i+1} = a_i + \frac1T\cdot|Wa_i|$

Assume that $$W \in \mathbb{R}^{d\times d}, a_i \in \mathbb{R}^d$$.

\begin{aligned}\because|Wa_i| &= \sqrt{(Wa_i)^TWa_i}\\ &=\sqrt{a_i^TW^TWa_i}\\ &=\sqrt{a_i^Ta_i}\\ &= |a_i| \end{aligned}

\begin{aligned}\frac{\partial|a_i|}{\partial a_i} &= \frac{\partial}{\partial a_i} \sqrt{a_i^Ta_i}\\ &=\frac{2a_i}{2\sqrt{a_i^Ta_i}}\\ &= \frac{a_i}{|a_i|} \end{aligned}

\begin{aligned}\therefore \frac{\partial a_{i+1}}{\partial a_i} &= I + \frac1T\frac{a_i}{|a_i|} \text{\quad (broadcasting to each column)} \end{aligned}

From chain rule,

$\frac{\partial L}{\partial a_t} = \frac{\partial L}{\partial a_T}\cdot \frac{\partial a_T}{\partial a_T-1}...\frac{\partial a_{t+1}}{\partial a_t}$

### 2.2 Upper bound

Reference: https://www.cs.utexas.edu/users/flame/laff/alaff-beta/chapter01-norms-matrix-submultiplicative.html

The 2-norm of matrix is defined by $||A||_2 = \sqrt{\lambda_1}$ where $$\lambda_1$$ is the greatest eigenvalue of $$A^TA$$.

The triangular inequality of matrix norm gives that

$||x+y|| \leq ||x|| + ||y||$

\begin{aligned} \therefore \bigg|\bigg| \frac{\partial a_{i+1}}{\partial a_i}\bigg|\bigg| &= \bigg|\bigg| I + \frac1T\frac{a_i}{|a_i|} \bigg|\bigg| \\ &\leq 1 + \frac1T \bigg|\bigg| \frac{a_i}{|a_i|} \bigg|\bigg| \\ &= 1 + \frac1T \end{aligned}

Note that the second term $$\frac1T\frac{a_i}{|a_i|}$$ should actually be broadcast to d columns to become a $$d\times d$$ matrix. But the norm of this matrix is the same as the norm of the vector before broadcasting.

Submultiplicative property of 2-norm gives: $||AB|| \leq ||A||\: ||B||$

Matrix 2-norm is subordinate to vector 2-norm: $\| A x \| \leq \| A \| \| x \|$

\begin{aligned} \therefore \bigg|\bigg|\frac{\partial L}{\partial a_t}\bigg|\bigg| &\leq \bigg|\bigg|\frac{\partial L}{\partial a_T} \bigg|\bigg| \: \bigg|\bigg|\frac{\partial a_T}{\partial a_T-1}\bigg|\bigg|...\bigg|\bigg|\frac{\partial a_{t+1}}{\partial a_t}\bigg|\bigg|\\ &\leq \bigg|\bigg|\frac{\partial L}{\partial a_T} \bigg|\bigg| \times (1+\frac1T)^{T-t}\\ &= \bigg|\bigg|\frac{\partial L}{\partial a_T}\bigg|\bigg| \times (1+\frac1T)^{T} \times (1+\frac1T)^{-t}\\ &\approx\bigg|\bigg|\frac{\partial L}{\partial a_T}\bigg|\bigg| \times e \times (1+\frac1T)^{-t} \text{\quad, T large enough} \end{aligned}

This gives an upper bound on the ratio $r = \frac{\|\frac{\partial L}{\partial a_t}\|}{\|\frac{\partial L}{\partial a_T}\|} \leq e\times (1+\frac1T)^{-t}$

### 2.3 Lower bound

Reference: https://ccse.lbl.gov/Publications/sepp/matrixLowerBound/LBNL-50635.pdf

The submultiplicative/subordinate property can be rewritten as

$\|A\|\|B^{-1}\|^{-1} \leq \|AB\|$

Denote $$B_i = I + \frac1T \frac{a_i}{\|a_i\|}$$.

Solve the inverse of $$B_i$$ \begin{aligned} \because(I + \frac1T \frac{a_i}{\|a_i\|})(I - \frac1T \frac{a_i}{\|a_i\|}) &= I - \frac1{T^2}I \\ &= \frac{T^2-1}{T^2}I \end{aligned}

$\therefore B_i^{-1} = \frac{T^2}{T^2-1}(I - \frac1T \frac{a_i}{\|a_i\|})$

Use the triangular inequality of matrix norm $$\|x-y\| \leq \|x\| + \|-y\| = \|x\| + \|y\|$$

$\therefore \|B_i^{-1}\| \leq \frac{T^2}{T^2-1}(1+\frac1T)=\frac T{T-1}$

$\therefore \|B_i^{-1}\|^{-1} \geq 1 - \frac{1}{T}$

\begin{aligned} \therefore r &= \frac{\|\frac{\partial L}{\partial a_t}\|}{\|\frac{\partial L}{\partial a_T}\|} \\ &\geq (\|B_{T-1}^{-1}\|...\|B_t^{-1}\|)^{-1}\\ &\geq (1 - \frac{1}{T})^{T-t}\\ &\approx \frac1e \times (1-\frac{1}{T})^{-t} \end{aligned}

### 2.4 Conclusion

From above we get

$(1+\frac1T)^{T-t} \leq r \leq (1 - \frac{1}{T})^{T-t}$

When $$T$$ is large,

$(1+\frac1T)^{T}\approx e$ $(1-\frac1T)^{T}\approx \frac1e$

$\frac1e \times (1-\frac{1}{T})^{-t} \leq r \leq e\times (1+\frac1T)^{-t}$