目录
1 Backpropagation
前文我们介绍了单个梯度的计算方法,本文在这基础上,介绍如何迭代使用链式原则来计算表达式的梯度,即反向传播(Backpropagation)。这对于我们理解,开发,设计和调试Neural Networks至关重要。
本文要探讨的主要问题如下。
反向传播主要问题:给定某个函数$f(x)$,其中$x$是输入向量,我们如何计算$f$对$x$的梯度,即$\triangledown f(x)$。
该问题在神经网络下的应用是$f$对应loss function,$x$对应training set中的data $(x_i,y_i), i=1 \ldots N$,权重$W$和偏置$b$。由于training set中的data是固定的,因此我们一般只需要计算$W$和$b$的梯度来更新它们的值来减少loss。
1.1 Introduction
1.1.1 Simple Interpreting the Gradient
一些简单的梯度我们总结如下。
1.1.2 Compound Expression, Chain Rule, Backpropagation
复合表达式的梯度遵循链式原则。
:反向传播/Chain Rule.png)
1.2 Intuition
Backgroupagation is an elegant local process. The basic unit is gate. During forward pass, each gate’s gradient, $\triangledown \frac {G_{out_i}}{G_{in_i}}$ and value ${G_{out_i}}$ is computed with knwon ${G_{in_i}}$. During backward pass, the previous computed gradient is chained by multiplication to calculate the final output’s gradient on initial input, $\triangledown \frac {G_{out_N}}{G_{in_0}}$.
1.3 Best Practice
1.3.1 Modularity: Sigmoid Example
我们可以将多个gate模块化成1个gate,这里用1个Sigmoid函数举例。
:反向传播/Modular Gate.png)
1.3.2 Staged Computation
下面这个形式只是用于练习利用中间值(Staged Value)来进行backpropagation。中间值使得我们的计算大大简化,而没有必要算出每一个微小gate的gradient。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
x = 3 # example values
y = -4
# forward pass
sigy = 1.0 / (1 + math.exp(-y)) # sigmoid in numerator #(1)
num = x + sigy # numerator #(2)
sigx = 1.0 / (1 + math.exp(-x)) # sigmoid in denominator #(3)
xpy = x + y #(4)
xpysqr = xpy**2 #(5)
den = sigx + xpysqr # denominator #(6)
invden = 1.0 / den #(7)
f = num * invden # done! #(8)
# backprop f = num * invden
dnum = invden # gradient on numerator #(8)
dinvden = num #(8)
# backprop invden = 1.0 / den
dden = (-1.0 / (den**2)) * dinvden #(7)
# backprop den = sigx + xpysqr
dsigx = (1) * dden #(6)
dxpysqr = (1) * dden #(6)
# backprop xpysqr = xpy**2
dxpy = (2 * xpy) * dxpysqr #(5)
# backprop xpy = x + y
dx = (1) * dxpy #(4)
dy = (1) * dxpy #(4)
# backprop sigx = 1.0 / (1 + math.exp(-x))
dx += ((1 - sigx) * sigx) * dsigx # Notice += !! #(3)
# backprop num = x + sigy
dx += (1) * dnum #(2)
#This follows the multivariable chain rule in Calculus, which states that if a variable branches out to different parts of the circuit, then the gradients that flow back to it will add
dsigy = (1) * dnum #(2)
# backprop sigy = 1.0 / (1 + math.exp(-y))
dy += ((1 - sigy) * sigy) * dsigy #(1)
# done! phew
1.4 Patterns in Backward Flow
:反向传播/Gate Patterns.png)
1.5 Gradients for Vectorized Operations
前面我们考虑的都是单变量,但是相关概念可以smooth地扩展到多变量(矩阵)。
1
2
3
4
5
6
7
8
9
# forward pass
W = np.random.randn(5, 10)
X = np.random.randn(10, 3)
D = W.dot(X)
# now suppose we had the gradient on D from above in the circuit
dD = np.random.randn(*D.shape) # same shape as D
dW = dD.dot(X.T) #.T gives the transpose of the matrix
dX = W.T.dot(dD)
2 Summary
:反向传播/线性分类Summary.png)