#9: CNN & RNN

Created

Oct 26, 2021 06:41 PM

Topics

+ LSTM

Features of CNN 1. Convolution Operation 1D Continuous Hyper Parameters 2D Discrete Image Processing Kernels Edge Detection Blurring Horizontal line detection Vertical line detection Example TODO Backpropagation in Convolution Layer 2. Pooling/Subsampling Layer Max Pooling Example Average Pooling Backpropagation in Pooling Layer 3. Convolution Layer and Weight Sharing Full Convolutional Neural Network Architecture Analysis Other Types of Classification Recurrent Neural Networks Pros Intuition Recurrent Unit Training Back Propagation Through Time: Backward Pass LSTM (Long short-term memory)Idea Composition

Features of CNN

Local receptive fields

Feature pooling

Weight sharing

1. Convolution Operation

如何通俗易懂地解释卷积？

我来举个通俗易懂的例子吧。我大一是这么理解记忆的，到现在大四一直没忘记过。要理解卷积，就必须树立起来" 瞬时行为的持续性后果 "这个概念。举个例子。在一个时刻点，我以迅雷不及掩耳之势吃下了一个冰激凌，此时我的体重瞬间增加，之后随着消化吸收能量利用和排泄等生理活动的进行，我的体重又缓慢下降。如下图所示：我们把这个函数记为。我们把基础体重记为0，即没吃冰淇凌的时候体重是0，吃冰淇凌的效果过去了之后体重还是0。我们记每一个冰淇凌带来的瞬间体重增加为。易知，。如何理解"瞬时行为的持续性后果"呢？在这个例子里，吃冰激凌是瞬间完成的动作，是一个瞬时行为；吃完冰激凌之后的体重的缓慢下降是持续了一段时间的，因此是吃冰激凌这个瞬时行为的一个持续性后果。此时，只有在0时刻的瞬间吃了一个冰淇凌，在0时刻的瞬间，吃冰淇凌的速度是，其中表示极小的一个时间段；在其他时刻，吃冰淇凌的速度为0。因此，我们可以用一个冲击函数来表示在这种情况下吃冰淇凌的速度。表示的是，当吃冰淇凌的速度为冲击函数的时候，对我的体重的影响。接下来我们考虑，我吃冰淇凌的频率很低，且每次只在一个瞬间吃一个冰淇凌，每次都等到体重恢复到原来的程度了再吃一个，那么我的体重变化就是这样子的。这种情况下，如果我想要知道每一个时刻的体重，只需要知道我吃每个冰淇凌的时刻，再知道吃一个冰淇凌的效果，很容易就能求出来了。接下来，我们考虑，如果我吃冰淇凌的速度恒定为1（注意不是一瞬间吃一个了，不是冲击函数），且时时刻刻都在吃冰淇凌，那么，在我连续吃了时间的冰淇凌之后，我的体重是多少呢？这个问题是不是有点不好算了呢？之前的冰淇凌增加的体重还没降到0呢，现在的冰淇凌带来的体重就又来了，还一直持续，还是连续的，想想就头疼。这个时候，要引入两个个原理。第一，线性原理。即，我在一瞬间吃冰淇凌的个数，会以线性的方式作用在冰淇凌对体重的影响函数上。我在一个瞬间吃了1个冰淇凌，之后我的体重变换是，如果我在一个瞬间吃了0.5个冰淇凌，之后我的体重变换是，如果n个呢，那就是。第二，累加原理。即，冰淇凌的作用效果是可以累加的。即，一段时间之前我吃了一个冰淇凌，经过了一段时间的体重下降，现在我的体重是。现在我又吃了一个冰淇凌，体重又增加了。假设这个增加是可以累积的（直观上也是可以累积的），那么我的体重就会是。这就是累加原理。这时我们来试着计算，在从开始就不停地吃冰淇凌，且吃冰淇凌的速度恒定为1的情况下，在任意时刻我的体重。由于我在不停地吃冰淇凌，所以，我们先算，在某时刻附近的一瞬间，我吃的冰淇凌对现在时刻的我的体重的影响。因为，吃冰淇凌的速度是1，时间是，因此，在这一瞬间我吃的冰淇凌的个数是。那么根据线性原理，在这一瞬间，我吃的冰淇凌对现在时刻的我的体重的影响就是。那么，根据累加原理，现在时刻的我的体重就是：从0到时刻我吃的所有冰淇凌对我的体重的影响的累加，即为：上面这个式子是不是有点像我们学过的卷积了呢？我们上面的讨论基于我们吃冰淇凌的速度是常数1，那么，如果我吃冰淇凌的速度不是常数，而是一个连续变化的函数，如在t时刻，吃冰淇凌的速度是。那么，在我连续吃了时间的冰淇凌之后，我的体重是多少呢？同样，我们先算，在某时刻附近的一瞬间，我吃的冰淇凌对现在时刻的我的体重的影响。因为，吃冰淇凌的速度是，时间是，因此，在这一瞬间吃的冰淇凌的个数是。那么根据线性原理，在这一瞬间，我吃的冰淇凌对现在时刻的我的体重的影响就是。再根据累加原理，现在时刻的我的体重就是：从0到时刻我吃的所有冰淇凌对我的体重的影响的累加，即为：这就是大家平时接触到的卷积了！因此，在我的理解下，我将卷积解释为：一个对象（本文中的吃冰淇凌）对一个系统（本文中的体重）的作用效果满足线性原理、累加原理。该对象对这个系统连续作用了一段时间后，求该系统的状态。这个时候，一个卷积就可以求出来了！在卷积中，第一个函数表示这个对象对系统的作用速度。第二个函数表示当作用速度为单位冲击函数时这个对象对系统的作用效果。我们来验证一下第二个函数的意义。取我吃冰淇凌的速度为单位冲击函数，则到时刻我的体重就是：，的的确确就是我吃冰淇凌的速度为单位冲击函数时，我的体重的变换。最后，是一点说明。课本上标准的卷积其实长成下面这个样子，积分区间是。这个在我这个case里也比较好理解，主要是考虑到时间的物理意义。第一，理解当时，恒成立。这个比较容易理解，因为，我在时刻吃的冰淇凌，对吃冰淇凌之前也就是 ...

https://www.zhihu.com/question/22298352/answer/1071892762

1D Continuous

(f * g )(t)=(g * f )(t)

The convolution of the function

f

by the kernel

g

(f * g)(t)=\int_{-\infty}^{\infty} f(\tau) g(t-\tau) d \tau

\tau=

a part of the convolution function

t=

iteration number

Hyper Parameters

Kernel size $K$ : size of the

g

function: how many

f(t)

is used to compute

f*g

Stride

S

: skip

S

inputs to compute each convolution

must < $K$

Makes $f*g$ more sparse

Input Padding $P$ : pad

P

zeros to the beginning & end of

f(t)

Reduce edge-effect: left-most & right-most inputs generates a different distribution

Processing Image: add a border of zeros

2D Discrete

Image Processing Kernels

Edge Detection

Edge = gradient of the values of the neighboring pixels

Similar neighboring pixels → small gradient → not edge

Different neighboring pixels → large gradient → edge

Blurring

Large kernel size ⇒ more blurring

Horizontal line detection

Vertical line detection

Example

Alexnet

Interpretation: most of them are directing edges at different directions

Randomly initialize kernals

Train the network to optimize the kernels

TODO ‣

Backpropagation in Convolution Layer

1D Discrete Example

Weights $w$ represents the kernel function

2. Pooling/Subsampling Layer

Applied after the Convolutional layer

A simpler convolution operation

Replace the output at a certain location with the summary statistics of nearby inputs

⇒ Make the network invariant to translation

Allows us to convert a variable sized input into a fixed size output ⇒ invariance

Max Pooling

Pick the largest in each 2x2 block

2x2 filter at stride 2 ⇒ decrease resolution from 4x4 to 2x2

Hyperparameter: Filter Kernel Size, Stride

Example

Average Pooling

L^2

norm of the rectangular neighborhood

Weighted average based on distance from the center pixel

Backpropagation in Pooling Layer

TBD

3. Convolution Layer and Weight Sharing

Feature maps: generate using kernels

Multiple kernels

Full Convolutional Neural Network Architecture

Each feature map can connect to any previous feature maps

Connection matrix

Analysis

CNN: Good at one-to-one classification

Other Types of Classification

one-to-many: generate captions for a movie, sentiment analysis

many-to-one: sentiment analysis

many-to-many: translation; given a video of variable number of frames, we want to classify each frame?

Recurrent Neural Networks

Pros

Able to handle variable size inputs & outputs

Able to handle sequential data

Intuition

Sequentially read from left to right

Maintain an internal memory state

Captures data seen so far

Updated with new information

Implementation: Recurrent Relation

Recurrent Unit

Training

Back Propagation Through Time: Backward Pass

Take the average of the multiple gradients computed toward $f_{w_1}$ in each pass to update $w_1$

LSTM (Long short-term memory)

💡

Encode "long-term memory" in a cell's state to solve the vanishing gradient problem

Idea

RNN: keep track of the arbitrary long-term dep input sequence

⇒ back-propagation leads the vanishing/exploding gradient problem

Vanishingly small ⇒ Stop further training

Explosively large ⇒

RNN only pass hidden state along the sequence; not good at dealing with long sequences

Composition

\begin{array}{ll} i_t = \sigma(W_{ii} x_t + b_{ii} + W_{hi} h_{t-1} + b_{hi}) \\ f_t = \sigma(W_{if} x_t + b_{if} + W_{hf} h_{t-1} + b_{hf}) \\ g_t = \tanh(W_{ig} x_t + b_{ig} + W_{hg} h_{t-1} + b_{hg}) \\ o_t = \sigma(W_{io} x_t + b_{io} + W_{ho} h_{t-1} + b_{ho}) \\ c_t = f_t \odot c_{t-1} + i_t \odot g_t \\ h_t = o_t \odot \tanh(c_t) \\ \end{array}

Lab6

Input = batch * ?? * input_size

Output = batch * output_size

Target = batch

ㅤ	Lab6	HW4
Input	batch * ?? * input size	batch * input size
Embed	ㅤ	input * hidden
ih	input * hidden	ㅤ
hh	hidden * hidden	ㅤ
Output	batch * output size	ㅤ
Target	batch	batch * input size

Features of CNN

1. Convolution Operation

1D Continuous

Hyper Parameters

2D Discrete

Image Processing Kernels

Edge Detection

Blurring

Horizontal line detection

Vertical line detection

Example

TODO ‣

Backpropagation in Convolution Layer

2. Pooling/Subsampling Layer

Max Pooling

Example

Average Pooling

Backpropagation in Pooling Layer

3. Convolution Layer and Weight Sharing

Full Convolutional Neural Network Architecture

Analysis

Other Types of Classification

Recurrent Neural Networks

Pros

Intuition

Recurrent Unit

Training

Back Propagation Through Time: Backward Pass

LSTM (Long short-term memory)

Idea

Composition

Table of Content