HW2
Hua Yao, UNI:hy2632
Problem 1: Convolutional Neural Networks [20 points]
The input RGB- image has 3 channels. Apply Conv2D to each channel to get 3 feature maps.
When there's no padding, the shape of 3 feature maps are \(125\times125\).
\[O = \frac{I-k}{s} + 1 = \frac{128-4}{1} +1 = 125\]
The number of parameters in the convolution layer is \(4^2 \times 3 = 48\)
The shape of 3 feature maps after max-pooling are \(61\times61\) \[O = \frac{I-k}{s} + 1 = \frac{125-5}{2} + 1 = 61\]
The number of parameters in maxpooling is \(0\).
After maxpooling, the fully connected layer maps \(I: 61\times61\times3 \to O:256\), which takes \(61\times61\times3\times256=2857728\) parameters.
The output layer maps \(I:256 \to O:10\) and then softmax. This step takes \(256\times10=2560\) parameters.
- Answer:
- Total number of parameters in the NN is \(48+0+2857728+2560=2860336\).
- The shape of feature maps of Conv layer is \(125\times125\), that of max-pooling is \(61\times61\).
- The max-pooling layer consists of 3 feature maps (same number of channels as in the Conv layer, because max-pooling is channel-wise)
Problem 2: Deep Residual Networks [40 points]
2.1 Form of partial derivative
\[a_{i+1} = a_i + \frac1T\cdot|Wa_i|\]
Assume that \(W \in \mathbb{R}^{d\times d}, a_i \in \mathbb{R}^d\).
\[\begin{aligned}\because|Wa_i| &= \sqrt{(Wa_i)^TWa_i}\\ &=\sqrt{a_i^TW^TWa_i}\\ &=\sqrt{a_i^Ta_i}\\ &= |a_i| \end{aligned}\]
\[\begin{aligned}\frac{\partial|a_i|}{\partial a_i} &= \frac{\partial}{\partial a_i} \sqrt{a_i^Ta_i}\\ &=\frac{2a_i}{2\sqrt{a_i^Ta_i}}\\ &= \frac{a_i}{|a_i|} \end{aligned}\]
\[\begin{aligned}\therefore \frac{\partial a_{i+1}}{\partial a_i} &= I + \frac1T\frac{a_i}{|a_i|} \text{\quad (broadcasting to each column)} \end{aligned}\]
From chain rule,
\[\frac{\partial L}{\partial a_t} = \frac{\partial L}{\partial a_T}\cdot \frac{\partial a_T}{\partial a_T-1}...\frac{\partial a_{t+1}}{\partial a_t}\]
2.2 Upper bound
Reference: https://www.cs.utexas.edu/users/flame/laff/alaff-beta/chapter01-norms-matrix-submultiplicative.html
The 2-norm of matrix is defined by \[||A||_2 = \sqrt{\lambda_1}\] where \(\lambda_1\) is the greatest eigenvalue of \(A^TA\).
The triangular inequality of matrix norm gives that
\[||x+y|| \leq ||x|| + ||y||\]
\[\begin{aligned} \therefore \bigg|\bigg| \frac{\partial a_{i+1}}{\partial a_i}\bigg|\bigg| &= \bigg|\bigg| I + \frac1T\frac{a_i}{|a_i|} \bigg|\bigg| \\ &\leq 1 + \frac1T \bigg|\bigg| \frac{a_i}{|a_i|} \bigg|\bigg| \\ &= 1 + \frac1T \end{aligned}\]
Note that the second term \(\frac1T\frac{a_i}{|a_i|}\) should actually be broadcast to d columns to become a \(d\times d\) matrix. But the norm of this matrix is the same as the norm of the vector before broadcasting.
Submultiplicative property of 2-norm gives: \[||AB|| \leq ||A||\: ||B||\]
Matrix 2-norm is subordinate to vector 2-norm: \[\| A x \| \leq \| A \| \| x \|\]
\[\begin{aligned} \therefore \bigg|\bigg|\frac{\partial L}{\partial a_t}\bigg|\bigg| &\leq \bigg|\bigg|\frac{\partial L}{\partial a_T} \bigg|\bigg| \: \bigg|\bigg|\frac{\partial a_T}{\partial a_T-1}\bigg|\bigg|...\bigg|\bigg|\frac{\partial a_{t+1}}{\partial a_t}\bigg|\bigg|\\ &\leq \bigg|\bigg|\frac{\partial L}{\partial a_T} \bigg|\bigg| \times (1+\frac1T)^{T-t}\\ &= \bigg|\bigg|\frac{\partial L}{\partial a_T}\bigg|\bigg| \times (1+\frac1T)^{T} \times (1+\frac1T)^{-t}\\ &\approx\bigg|\bigg|\frac{\partial L}{\partial a_T}\bigg|\bigg| \times e \times (1+\frac1T)^{-t} \text{\quad, T large enough} \end{aligned}\]
This gives an upper bound on the ratio \[r = \frac{\|\frac{\partial L}{\partial a_t}\|}{\|\frac{\partial L}{\partial a_T}\|} \leq e\times (1+\frac1T)^{-t}\]
2.3 Lower bound
Reference: https://ccse.lbl.gov/Publications/sepp/matrixLowerBound/LBNL-50635.pdf
The submultiplicative/subordinate property can be rewritten as
\[\|A\|\|B^{-1}\|^{-1} \leq \|AB\|\]
Denote \(B_i = I + \frac1T \frac{a_i}{\|a_i\|}\).
Solve the inverse of \(B_i\) \[\begin{aligned} \because(I + \frac1T \frac{a_i}{\|a_i\|})(I - \frac1T \frac{a_i}{\|a_i\|}) &= I - \frac1{T^2}I \\ &= \frac{T^2-1}{T^2}I \end{aligned}\]
\[\therefore B_i^{-1} = \frac{T^2}{T^2-1}(I - \frac1T \frac{a_i}{\|a_i\|})\]
Use the triangular inequality of matrix norm \(\|x-y\| \leq \|x\| + \|-y\| = \|x\| + \|y\|\)
\[\therefore \|B_i^{-1}\| \leq \frac{T^2}{T^2-1}(1+\frac1T)=\frac T{T-1}\]
\[\therefore \|B_i^{-1}\|^{-1} \geq 1 - \frac{1}{T}\]
\[\begin{aligned} \therefore r &= \frac{\|\frac{\partial L}{\partial a_t}\|}{\|\frac{\partial L}{\partial a_T}\|} \\ &\geq (\|B_{T-1}^{-1}\|...\|B_t^{-1}\|)^{-1}\\ &\geq (1 - \frac{1}{T})^{T-t}\\ &\approx \frac1e \times (1-\frac{1}{T})^{-t} \end{aligned}\]
2.4 Conclusion
From above we get
\[(1+\frac1T)^{T-t} \leq r \leq (1 - \frac{1}{T})^{T-t}\]
When \(T\) is large,
\[(1+\frac1T)^{T}\approx e \] \[(1-\frac1T)^{T}\approx \frac1e \]
\[\frac1e \times (1-\frac{1}{T})^{-t} \leq r \leq e\times (1+\frac1T)^{-t}\]