HW2

Hua Yao, UNI:hy2632

Problem 1: Convolutional Neural Networks [20 points]

The input RGB- image has 3 channels. Apply Conv2D to each channel to get 3 feature maps.

When there's no padding, the shape of 3 feature maps are \(125\times125\).

\[O = \frac{I-k}{s} + 1 = \frac{128-4}{1} +1 = 125\]

The number of parameters in the convolution layer is \(4^2 \times 3 = 48\)

The shape of 3 feature maps after max-pooling are \(61\times61\) \[O = \frac{I-k}{s} + 1 = \frac{125-5}{2} + 1 = 61\]

The number of parameters in maxpooling is \(0\).

After maxpooling, the fully connected layer maps \(I: 61\times61\times3 \to O:256\), which takes \(61\times61\times3\times256=2857728\) parameters.

The output layer maps \(I:256 \to O:10\) and then softmax. This step takes \(256\times10=2560\) parameters.

Answer:
- Total number of parameters in the NN is \(48+0+2857728+2560=2860336\).
- The shape of feature maps of Conv layer is \(125\times125\), that of max-pooling is \(61\times61\).
- The max-pooling layer consists of 3 feature maps (same number of channels as in the Conv layer, because max-pooling is channel-wise)

Problem 2: Deep Residual Networks [40 points]

2.1 Form of partial derivative

\[a_{i+1} = a_i + \frac1T\cdot|Wa_i|\]

Assume that \(W \in \mathbb{R}^{d\times d}, a_i \in \mathbb{R}^d\).

\[\begin{aligned}\because|Wa_i| &= \sqrt{(Wa_i)^TWa_i}\\ &=\sqrt{a_i^TW^TWa_i}\\ &=\sqrt{a_i^Ta_i}\\ &= |a_i| \end{aligned}\]

\[\begin{aligned}\frac{\partial|a_i|}{\partial a_i} &= \frac{\partial}{\partial a_i} \sqrt{a_i^Ta_i}\\ &=\frac{2a_i}{2\sqrt{a_i^Ta_i}}\\ &= \frac{a_i}{|a_i|} \end{aligned}\]

\[\begin{aligned}\therefore \frac{\partial a_{i+1}}{\partial a_i} &= I + \frac1T\frac{a_i}{|a_i|} \text{\quad (broadcasting to each column)} \end{aligned}\]

From chain rule,

\[\frac{\partial L}{\partial a_t} = \frac{\partial L}{\partial a_T}\cdot \frac{\partial a_T}{\partial a_T-1}...\frac{\partial a_{t+1}}{\partial a_t}\]

2.2 Upper bound

Reference: https://www.cs.utexas.edu/users/flame/laff/alaff-beta/chapter01-norms-matrix-submultiplicative.html

The 2-norm of matrix is defined by \[||A||_2 = \sqrt{\lambda_1}\] where \(\lambda_1\) is the greatest eigenvalue of \(A^TA\).

The triangular inequality of matrix norm gives that

\[||x+y|| \leq ||x|| + ||y||\]

\[\begin{aligned} \therefore \bigg|\bigg| \frac{\partial a_{i+1}}{\partial a_i}\bigg|\bigg| &= \bigg|\bigg| I + \frac1T\frac{a_i}{|a_i|} \bigg|\bigg| \\ &\leq 1 + \frac1T \bigg|\bigg| \frac{a_i}{|a_i|} \bigg|\bigg| \\ &= 1 + \frac1T \end{aligned}\]

Note that the second term \(\frac1T\frac{a_i}{|a_i|}\) should actually be broadcast to d columns to become a \(d\times d\) matrix. But the norm of this matrix is the same as the norm of the vector before broadcasting.

Submultiplicative property of 2-norm gives: \[||AB|| \leq ||A||\: ||B||\]

Matrix 2-norm is subordinate to vector 2-norm: \[\| A x \| \leq \| A \| \| x \|\]

\[\begin{aligned} \therefore \bigg|\bigg|\frac{\partial L}{\partial a_t}\bigg|\bigg| &\leq \bigg|\bigg|\frac{\partial L}{\partial a_T} \bigg|\bigg| \: \bigg|\bigg|\frac{\partial a_T}{\partial a_T-1}\bigg|\bigg|...\bigg|\bigg|\frac{\partial a_{t+1}}{\partial a_t}\bigg|\bigg|\\ &\leq \bigg|\bigg|\frac{\partial L}{\partial a_T} \bigg|\bigg| \times (1+\frac1T)^{T-t}\\ &= \bigg|\bigg|\frac{\partial L}{\partial a_T}\bigg|\bigg| \times (1+\frac1T)^{T} \times (1+\frac1T)^{-t}\\ &\approx\bigg|\bigg|\frac{\partial L}{\partial a_T}\bigg|\bigg| \times e \times (1+\frac1T)^{-t} \text{\quad, T large enough} \end{aligned}\]

This gives an upper bound on the ratio \[r = \frac{\|\frac{\partial L}{\partial a_t}\|}{\|\frac{\partial L}{\partial a_T}\|} \leq e\times (1+\frac1T)^{-t}\]

2.3 Lower bound

Reference: https://ccse.lbl.gov/Publications/sepp/matrixLowerBound/LBNL-50635.pdf

The submultiplicative/subordinate property can be rewritten as

\[\|A\|\|B^{-1}\|^{-1} \leq \|AB\|\]

Denote \(B_i = I + \frac1T \frac{a_i}{\|a_i\|}\).

Solve the inverse of \(B_i\) \[\begin{aligned} \because(I + \frac1T \frac{a_i}{\|a_i\|})(I - \frac1T \frac{a_i}{\|a_i\|}) &= I - \frac1{T^2}I \\ &= \frac{T^2-1}{T^2}I \end{aligned}\]

\[\therefore B_i^{-1} = \frac{T^2}{T^2-1}(I - \frac1T \frac{a_i}{\|a_i\|})\]

Use the triangular inequality of matrix norm \(\|x-y\| \leq \|x\| + \|-y\| = \|x\| + \|y\|\)

\[\therefore \|B_i^{-1}\| \leq \frac{T^2}{T^2-1}(1+\frac1T)=\frac T{T-1}\]

\[\therefore \|B_i^{-1}\|^{-1} \geq 1 - \frac{1}{T}\]

\[\begin{aligned} \therefore r &= \frac{\|\frac{\partial L}{\partial a_t}\|}{\|\frac{\partial L}{\partial a_T}\|} \\ &\geq (\|B_{T-1}^{-1}\|...\|B_t^{-1}\|)^{-1}\\ &\geq (1 - \frac{1}{T})^{T-t}\\ &\approx \frac1e \times (1-\frac{1}{T})^{-t} \end{aligned}\]

2.4 Conclusion

From above we get

\[(1+\frac1T)^{T-t} \leq r \leq (1 - \frac{1}{T})^{T-t}\]

When \(T\) is large,

\[(1+\frac1T)^{T}\approx e \] \[(1-\frac1T)^{T}\approx \frac1e \]

\[\frac1e \times (1-\frac{1}{T})^{-t} \leq r \leq e\times (1+\frac1T)^{-t}\]

豁蒙楼

Data-Mining-HW2

HW2